Data clustering - example of banknote authentication

1. Introduction

Machine learning tools are very important in many difference aspects of life and can make various processes quicker and more precise than a human could. One of those examples could include banknote verification that aims to determine possible freud and ensure credibility and trust within a currency. In recent decades there have been profound advancements in the criminal counterfeit technology and in effect, responsible organizations were forced to look at more effective ways of differentiating fake money. The goal of this project is to showcase how to effectively cluster data using different methods, quality assessment measures, and potentially use it to make accurate predictions. Moreover, this research can help to understand how clustering can be helpful in dividing banknote-related data into similar groups that could help with efficiently categorizing banknotes due to their characteristics and in effect distinguish groups of probable fraud, making the following authentication process easier.

2. Dataset

Criminals have incentives to create counterfeit banknotes, and even with existing government authorization techniques many people could have trouble with distinguishing forgery. The most traditional verification processes include checking and comparing the texture, serial numbers and their security features, however with the technology advancements that typical method is usually insufficient. Therefore, more effective and robust techniques are necessary, and the solution often lies with photographs and their machine learning analysies. The data used in this project was extracted from the UC Irvine Machine Learning Repository (Lohweg, 2012), and consists of 4 characteristics that refer to imagines of both real and counterfeit banknotes. In theory, analysis of the selected features and whether they fit within expected norms should help to distinguish they ones that were tampered with. The mentioned characteristics include the picture’s variance which measures how much the neighboring pixels vary from each other, skewness which shows the lack of symmetry in the image, kurtosis which applies to noise reduction in an image, and entropy which refers to randomness of a picture. The dataset consists of 1372 observations for all of the mentioned 4 variables, that will be used to group analyzed banknotes into group based on their similarity with regard to their characteristics.

library(cluster)
library(factoextra)
library(flexclust)
library(fpc)
library(ClusterR)

load("data.RData")
summary(data)

##     variance          skewness          curtosis          entropy       
##  Min.   :-7.0421   Min.   :-13.773   Min.   :-5.2861   Min.   :-8.5482  
##  1st Qu.:-1.7730   1st Qu.: -1.708   1st Qu.:-1.5750   1st Qu.:-2.4135  
##  Median : 0.4962   Median :  2.320   Median : 0.6166   Median :-0.5867  
##  Mean   : 0.4337   Mean   :  1.922   Mean   : 1.3976   Mean   :-1.1917  
##  3rd Qu.: 2.8215   3rd Qu.:  6.815   3rd Qu.: 3.1793   3rd Qu.: 0.3948  
##  Max.   : 6.8248   Max.   : 12.952   Max.   :17.9274   Max.   : 2.4495

All of the selected variables show quite ranges, however considering their characteristics, it seems like the data is well balanced. The following analysis will use the variance, skewness, kurtosis and entropy of the banknote pictures in order to, most importantly, show the process of effective clustering, and then hopefully group suspicious observations together to improve the later authentication process.

3. Clustering through K-means, PAM, Clara

3.1 Methodology

Clustering is one of unsupervised machine learning techniques, that aims to group observations together based on their characteristics. The main goal of the process is to obtain a predetermined number of clusters, in which the similarity of points within a cluster is maximized while also maximizing the dissimilarity between separate groups. The dissimilarity is calculated with the help of a distance matrix estimated through the use of different distance metrics, of which the most popular include euclidian, manhattan, minkowski and canberra distances. Additionally, there are also many different available clustering methods, that differ according to how the groupings of clusters are conducted. For example, in many cases when selected data shows some sort of structure, hierarchical clustering might be the most effective solution, however in the current analysis that does not seem to hold true. Subsequently, the hierarchical method will not be one of the selected options; the research will however focus on other popular clustering techniques - K-means, PAM and Clara, which will be explored in detail.

3.2 K-means

The K-means clustering method derives its name from the process that relies on grouping observations according to k ceintroids, which represent the mean of all the points in a cluster. This technique repeats the grouping algorithm iteratively until the predetermined number of clusters has been reached and considering the distances of each observation, its nearest centroid and other groups, there are no more incentives to change the assignments. In effect, K-means is a simple and effective way to handle even big datasets, however it is worth pointing out that the algorithm is sensitive to outliers, which heavily affect calculated mean points of each cluster (centroids) and consequently the whole grouping result.

Using K-means requires to specify the number of desired clusters, as well the distance metric that will be used for calculations. In order to show how different clustering might look on the analysed data, 3 metrics will be used: euclidean distance which represent the straight-line distance of two points in a euclidean space, manhattan distance which is obtained as a sum of absolute differences of the coordinates of two points, and the minkowski distance which generalizes both of the previous metrics. For graphical representations of the clustering results, 4 used variables will be transformed into 2 dimensions that explain the process. The following outcomes show different groupings according to the amount of set clusters (3, 4, 5) and selected metric. Results can be analyzed in terms of both the visualizations and the center values of all the created clusters.

km1<-eclust(data, "kmeans", hc_metric="euclidean",k=3, graph=FALSE)
km2<-eclust(data, "kmeans", hc_metric="manhattan",k=4, graph=FALSE)
km3<-eclust(data,"kmeans", hc_metric="minkowski", k=5, graph=FALSE)

fviz_cluster(km1, main="k-means, clusters: 3, metric: euclidean")

km1$centers

##     variance   skewness  curtosis    entropy
## 1  0.7443709  7.3275653 -1.584962 -2.8769837
## 2  0.9869375  0.0997388  1.572734  0.2278310
## 3 -1.9193186 -7.7398343  8.998761 -0.5048393

fviz_cluster(km2, main="k-means, clusters: 4, metric: manhattan")

km2$centers

##    variance   skewness   curtosis    entropy
## 1  2.918668  7.9527259 -2.1913144 -1.9260765
## 2  1.247086 -0.2220863  1.9061592  0.3664473
## 3 -1.993524 -7.8154507  9.0472086 -0.5185061
## 4 -1.864731  5.7478975 -0.8322658 -3.4747764

fviz_cluster(km3, main="k-means, clusters: 5, metric: minkowski")

km3$centers

##     variance   skewness   curtosis    entropy
## 1  2.7134244  7.6674147 -2.1451028 -2.0125342
## 2  2.1024673 -0.9852707  3.1128382  0.5121493
## 3 -2.1423634 -7.9950470  9.2238364 -0.5828171
## 4 -0.9213938  1.9423493 -0.7362124 -0.9324843
## 5 -2.9689451  9.2667083  0.4097818 -5.3370660

The initial visualizations seem to indicate that the data is in fact able to be clustered, and created groups indeed seem to be characterized by different levels of variance, skewness, kurtosis and entropy, however the best number of clusters is not immediately ascertainable. All of the groupings appear to be alright according to both their initial visualizations as well as their silhouettes visible below.

sil1<-silhouette(km1$cluster, dist(data))
sil2<-silhouette(km2$cluster, dist(data))
sil3<-silhouette(km3$cluster, dist(data))

fviz_silhouette(sil1)

##   cluster size ave.sil.width
## 1       1  577          0.34
## 2       2  582          0.36
## 3       3  213          0.49

fviz_silhouette(sil2)

##   cluster size ave.sil.width
## 1       1  324          0.42
## 2       2  524          0.35
## 3       3  209          0.48
## 4       4  315          0.15

fviz_silhouette(sil3)

##   cluster size ave.sil.width
## 1       1  360          0.35
## 2       2  344          0.27
## 3       3  198          0.46
## 4       4  349          0.32
## 5       5  121          0.36

Silhouettes are one of the methods of measuring the quality of clustering, and specifically how well each of the observations fits the cluster it was assigned. The most undesirable result consists of negative values within a silhouette, which indicate that some points inside a cluster would actually be fitted better to a different group. However, in case of all the different K-means clustering versions above, the below zero parts of silhouettes seem to be minimal. In general, choosing the most optimal number of clusters is a complex issue and will be discussed more in detail later in the analysis.

3.3 PAM

PAM (Partitioning Around Medoids) is another approach to clustering, quite similar to K-mean, however with one important distinction. Whereas in K-means the grouping process is based around the created mean of points that acts as center for each emerging cluster, in PAM clusters are centered around actual observations that exist within the dataset. Therefore, instead of creating an a new average reference point for each group, PAM clustering happens around medoids - actual data points. In result clustering through this method is much less sensitive to outliers, as even the most deviating observations have less of an impact on actually existing medoids that newly calculated mean centroids. However, due to these algorithmic changes, PAM is characterized by a much higher computational complexity, making it a less efficient solution for big datasets.In case of the analyzed data however, the partitioning results for 3, 4 and 5 clusters are quite similar to those obtained with K-means.

pam1<-pam(data,3)
fviz_cluster(pam1, geom="point", ellipse.type="convex", main="PAM, clusters: 3")

fviz_silhouette(pam1, main="PAM silhouette, clusters: 3")

##   cluster size ave.sil.width
## 1       1  512          0.35
## 2       2  618          0.34
## 3       3  242          0.45

pam2<-pam(data,4)
fviz_cluster(pam2, geom="point", ellipse.type="convex", main="PAM, clusters: 4")

fviz_silhouette(pam2, main="PAM silhouette, clusters: 4")

##   cluster size ave.sil.width
## 1       1  485          0.31
## 2       2  294          0.32
## 3       3  241          0.41
## 4       4  352          0.34

pam3<-pam(data,5)
fviz_cluster(pam3, geom="point", ellipse.type="convex", main="PAM, clusters: 5")

fviz_silhouette(pam3, main="PAM silhouette, clusters: 5")

##   cluster size ave.sil.width
## 1       1  405          0.33
## 2       2  218          0.27
## 3       3  284          0.30
## 4       4  193          0.37
## 5       5  272          0.24

3.4 Clara

Last of the considered clustering techniques is Clara (Clustering Large Applications), which is an extension of the PAM method, designed to address the issue of too high computational complexity. Similarly to PAM, Clara clustering focuses its grouping process around medoids, however instead of working on the entire dataset at once, it iteratively concentrates on smaller samples. In effect, the results are achieved easily and without sensitivity to outliers, however there may be some level of loss of precision and generalizability due to the sampling approach.

cl1<-eclust(data, "clara", k=3, graph=FALSE) 
fviz_cluster(cl1, palette=c("#F4F50C", "#A2A204", "#7B7B60"), ellipse.type="t", geom="point", pointsize=1, ggtheme=theme_classic(), main="Clara, clusters: 3")

fviz_silhouette(cl1, palette=c("#F4F50C", "#A2A204", "#7B7B60"), main="Clara silhouette, clusters: 3")

##   cluster size ave.sil.width
## 1       1  549          0.34
## 2       2  581          0.36
## 3       3  242          0.44

cl2<-eclust(data, "clara", k=4, graph=FALSE) 
fviz_cluster(cl2, palette=c("#9E1EC1", "#715879", "#4C0D60", "#F60FE9"), ellipse.type="t", geom="point", pointsize=1, ggtheme=theme_classic(), main="Clara, clusters: 4")

fviz_silhouette(cl2, palette=c("#9E1EC1", "#715879", "#4C0D60", "#F60FE9"), main="Clara silhouette, clusters: 4")

##   cluster size ave.sil.width
## 1       1  454          0.28
## 2       2  511          0.31
## 3       3  242          0.42
## 4       4  165          0.38

cl3<-eclust(data, "clara", k=5, graph=FALSE) 
fviz_cluster(cl3, palette=c("#0FF4F6", "#178182", "#26B0F0", "#1058D3", "#093682"), ellipse.type="t", geom="point", pointsize=1, ggtheme=theme_classic(), main="Clara, clusters: 5")

fviz_silhouette(cl3, palette=c("#0FF4F6", "#178182", "#26B0F0", "#1058D3", "#093682"), main="Clara silhouette, clusters: 5")

##   cluster size ave.sil.width
## 1       1  332          0.29
## 2       2  256          0.32
## 3       3  247          0.33
## 4       4  208          0.36
## 5       5  329          0.26

3.5 Cluster statistics

While comparing the different obtained clusterings, aside from focusing on visual differences, it is also advisable to examine the statistics of all the variables and all the clusters in each of the chosen versions. The values of main statistics for a specific clustering option can be calculated manually through a loop, which allows for potential addition of more functions if needed. The example calculations for the mean, minimum, maximum and standard deviation values for the K-means cluastering with k=3 is shown below.

#for 3 clusters and 4 variables - using k-means
# for each variable by groups, for many statistics
km1<-eclust(data, "kmeans", hc_metric="euclidean",k=3, graph=FALSE)
xxc<-as.data.frame(cbind(data, km1$cluster))
stats<-matrix(0, nrow=12, ncol=4)
colnames(stats)<-c("mean","sd","min", "max")
rownames(stats)<-rep(c("cluster1","cluster2", "cluster3"),times=4)
rownames(stats)<-paste(rownames(stats), rep(c("variance","skewness","curtosis","entropy"), each=3))
funs<-c("mean","sd","min", "max")

for(i in 1:4){ # iterating by variables
  for(j in 1:4){ # iterating by functions
    temp<-aggregate(xxc[,i], by=list(xxc[,5]), funs[j])
    stats[(3*i-2):(3*i),j]<-temp$x}}
stats

##                         mean        sd      min      max
## cluster1 variance  0.7443709 2.9704246  -7.0421  6.82480
## cluster2 variance  0.9869375 2.5581238  -4.9462  6.09190
## cluster3 variance -1.9193186 1.9205531  -5.2943  3.15410
## cluster1 skewness  7.3275653 2.5974515   2.5349 12.95160
## cluster2 skewness  0.0997388 2.3398880  -5.4236  4.92280
## cluster3 skewness -7.7398343 2.6435377 -13.7731 -1.57890
## cluster1 curtosis -1.5849619 2.0729590  -5.2861  3.32850
## cluster2 curtosis  1.5727344 2.3835350  -3.3668  8.82940
## cluster3 curtosis  8.9987606 3.2468647   3.8298 17.92740
## cluster1 entropy  -2.8769837 2.0520860  -8.5482  0.86937
## cluster2 entropy   0.2278310 0.8558745  -2.5095  2.44950
## cluster3 entropy  -0.5048393 1.1670370  -3.3202  2.13530

However, if only the most popular statistic descriptors are necessary, it is also possible to use a designated function “describeBy”. The example below shows the results calculated for 3 clusters as well, only this time using the PAM method, and allows to notice that the values differ slightly across the two differing grouping techniques.

library(psych)

#for 3 clusters and 4 variables - using PAM
pam1<-pam(data, 3)
xxc<-as.data.frame(cbind(data, pam1$cluster))
describeBy(xxc[,1:4], xxc[,5])

## 
##  Descriptive statistics by group 
## group: 1
##          vars   n  mean   sd median trimmed  mad   min   max range  skew
## variance    1 512  1.13 2.96   1.34    1.42 2.61 -7.04  6.82 13.87 -0.80
## skewness    2 512  7.84 2.29   7.90    7.90 2.51  3.18 12.95  9.77 -0.16
## curtosis    3 512 -1.57 2.16  -1.65   -1.65 2.64 -5.29  3.33  8.61  0.20
## entropy     4 512 -2.88 2.18  -2.87   -2.78 2.40 -8.55  1.08  9.63 -0.32
##          kurtosis   se
## variance     0.32 0.13
## skewness    -0.73 0.10
## curtosis    -1.00 0.10
## entropy     -0.74 0.10
## ------------------------------------------------------------ 
## group: 2
##          vars   n  mean   sd median trimmed  mad   min  max range  skew
## variance    1 618  0.64 2.64   0.38    0.66 3.32 -4.95 6.09 11.04  0.04
## skewness    2 618  0.58 2.32   0.73    0.75 2.48 -5.19 4.92 10.11 -0.54
## curtosis    3 618  1.04 2.32   0.88    0.98 2.21 -3.65 7.76 11.41  0.28
## entropy     4 618 -0.08 1.25   0.16    0.06 1.02 -4.18 2.45  6.63 -1.02
##          kurtosis   se
## variance    -1.15 0.11
## skewness    -0.53 0.09
## curtosis    -0.08 0.09
## entropy      0.91 0.05
## ------------------------------------------------------------ 
## group: 3
##          vars   n  mean   sd median trimmed  mad    min   max range  skew
## variance    1 242 -1.58 2.10  -1.84   -1.74 1.86  -5.29  4.34  9.63  0.70
## skewness    2 242 -7.17 3.00  -6.76   -7.06 2.41 -13.77  1.07 14.84 -0.27
## curtosis    3 242  8.61 3.27   8.19    8.24 3.22   3.44 17.93 14.49  0.94
## entropy     4 242 -0.45 1.14  -0.37   -0.38 1.10  -3.32  2.14  5.46 -0.47
##          kurtosis   se
## variance    -0.16 0.14
## skewness     0.21 0.19
## curtosis     0.50 0.21
## entropy     -0.28 0.07

The characteristics of obtained various clustering results can be also portrayed with the help of visual stripe plots (for k-means), which reflect the distance of data points to cluster centroids.

d1<-cclust(data, 3, dist="euclidean")
d2<-cclust(data, 4, dist="euclidean")
d3<-cclust(data, 5, dist="euclidean")

#stripe plots for k-means clustering with k = 3, 4, 5
stripes(d1)

stripes(d2)

stripes(d3)

More visualization functions also include boxplots for variables in groups,

km1<-kmeans(data, 3)
groupBWplot(data, km1$cluster, alpha=0.05)

km2<-kmeans(data, 4)
groupBWplot(data, km2$cluster, alpha=0.05)

km3<-kmeans(data, 5)
groupBWplot(data, km3$cluster, alpha=0.05)

or even dotplots in groups according to different variables.

#for 3 clusters
km1<-kmeans(data, 3)
xxc<-as.data.frame(cbind(data, km1$cluster))
xyplot(xxc[,1] ~ xxc[,2] | km1$cluster, data=xxc, xlab="variance", ylab="skewness")

xyplot(xxc[,3] ~ xxc[,4] | km1$cluster, data=xxc, xlab="curtosis", ylab="entropy")

4. Optimal clustering

Even after focusing on statistical values and visual representations of both the cluster spreads and the silhouettes, unambiguously deciding the best number of clusters for the dataset is very difficult and requires further exploration. However firstly, it might be beneficial to make sure if the data can be even clustered in a statistically significant way, in order to ensure the validity of clustering quality tests performed later in the project. The most commonly used method of assessing the clustering tendencies of a dataset is the Hopkins’ statistic, which helps to understand how well the analyzed data can be clustered. The null hypothesis states that the chosen observations are uniformly distributed and no statistically significant clusters can be found. The value of the statistic ranges between 0 and 1 and the lower its level the worse clustering tendency of the data. In the case of the banknote features however, the obtained value equaled to 0.9999587, which very strongly suggest the need to reject the null hypothesis - the analyzed data is indeed able to be grouped in a significant way and the clustering analysis is justified.

hopkins::hopkins(data, m=nrow(data)/10)

## [1] 0.9999776

4.1 Optimal number of clusters

Deciding on an optimal number of clusters is usually not entirely straightforward and requires checking several different methods that asses the quality of clusters according to their number. The most popular options include focusing on the previously mentioned silhouette statistic, as well as the WSS (Within-Cluster Sum of Squares) measure and the gap statistic. The silhouettes, previously displayed for various clustering versions, can not only help to asses if the observations are grouped properly, but the average silhouette width for each of the k numbers can also help to decide on the optimal value for k. For PAM clustering the plot indicates the highest result for number of clusters 2, however k=3 is not that much lower on the scale.

#method = PAM
fviz_nbclust(data, FUNcluster=cluster::pam, method = "silhouette")

On the other hand, the WSS measure aims to asses the total variance within each cluster in order to explain the level of homogeneity, with low values indicate compact clusters. For k values ranging from 3-7 and from 8-10 the obtained values seem to be very similar and quite low in both of these groups. Choosing only 2 clusters visibly results in a higher WSS value, which might suggest that deciding on k = 3 might be beneficial.

fviz_nbclust(data, FUNcluster=cluster::pam, method = "wss")

Finally, the gap statistic compares the formed clusters and the patter within them to theoretical uniform distribution. The increase in the gap statistics value when k is higher means that adding more clusters improves the clustering quality, and specifically the difference between the observed clustering and the random (uniform) clustering is the biggest. In case of a Clara clustering outcomes visible below, it seems that going from 2 to 3 clusters results in a significant improvement in quality. After k=3 the function sitll rises, but much more slowly and then locally peaks at k=5. The last significant improvement seems to happen while changing from 7 to 8 clusters, however those numbers might be too high for the analyzed data. According to those results, it might be beneficial to consider numbers values 3 or 5 as the most optimal cluster numbers.

fviz_nbclust(data, FUNcluster=cluster::clara, method="gap_stat")+ theme_classic()

The mentioned techniques help to obtain some lovel of insight on the quality of different number of clusters on the case of the considered variance, skewness, curtosit and entropy, but the decision on the k value is still not unambiguous. Therefore, more things might be necessary to consider, and some interesting measure can be found in the “ClusterR” package. Followint plots offer additional details for levels of variance explained, WCSSE (within-cluster-sum-of-squared-error), adjusted R squared and AIC for k-means clustering.

#additinal measures - variance explained, WCSSE (within-cluster-sum-of-squared-error), adjusted R squared, AIC
opt<-Optimal_Clusters_KMeans(data, max_clusters=10, plot_clusters = TRUE)

opt<-Optimal_Clusters_KMeans(data, max_clusters=10, plot_clusters = TRUE, criterion = "WCSSE")

opt<-Optimal_Clusters_KMeans(data, max_clusters=10, plot_clusters=TRUE, criterion="Adjusted_Rsquared")

opt<-Optimal_Clusters_KMeans(data, max_clusters=10, plot_clusters=TRUE, criterion="AIC")

Moreover, the analysis can also consider the levels of dissimilarity measure, which in this case has been calculated for potentially the most optimal k values equal to 2 and 3.

opt_md<-Optimal_Clusters_Medoids(data, 10, 'euclidean', plot_clusters=TRUE)

Apart from various statistical measures, a few R packages also offers automatic algorithms that in case of an inconclusive results might be worth considering. Package “NbClust” considered values of k ranging from 2 to 15 and through an automatic selection process assessed 3 to be the most optimal number of clusters.

#automatic selection using NbClust
library(NbClust)
c3<-NbClust(data, distance="euclidean", method="complete", index="ch")
c3$All.index

##         2         3         4         5         6         7         8         9 
##  710.6478 1171.0569 1140.1218 1092.2639  953.3701  917.4463  909.8658  864.9369 
##        10        11        12        13        14        15 
##  876.2699  828.6050  872.0372  885.2974  954.0464  930.1902

#best k value
c3$Best.nc

## Number_clusters     Value_Index 
##           3.000        1171.057

However, in the case of package “fpc” the function “pamk” for the PAM clustering method, automatically determined 2 to be the best choice for the predetermined number of clusters.

#automatic selection for PAM using fcp::pamk
pamk.best<-pamk(data, krange=2:10,criterion="asw", usepam=TRUE, scaling=FALSE, alpha=0.001, diss=inherits(data, "dist"), critout=FALSE)
pamk.best$crit

##  [1] 0.0000000 0.4004089 0.3651353 0.3389915 0.2999610 0.3159604 0.3057440
##  [8] 0.3386732 0.3470717 0.3672559

#best k value
pamk.best$nc

## [1] 2

4.2 Clustering quality

One of the characteristic qualities of clustering methods is that different numbers of predetermined clusters can bring different benefits to the analysis, and even when considering many different measures or even automatic selection algorithms, their results are not always coherent. In order to add some final insight into making the best decision on the best value for k, this section will focus on some final statistical indexes and tests, which will asses and compare the qualities of various clustering options.

Firstly, to better understand, which of the variables drive the cluster assignments, it might be beneficiary to explore the feature importance in the analysis. With the help of the function “FeatureImpCluster”, it is possible to conclude that skewness is the most relevant variable in terms of the impact on cluster assignments.

library(FeatureImpCluster)
km<-kcca(data, k=3)
FeatureImp_km<-FeatureImpCluster(km, as.data.table(data))

plot(FeatureImp_km)

Additionally, it might be valuable to examine, if the observations are being clustered in a similar way if only 2 variables at a time are considered. As a similarity measure, the Rand Index can determine how closely two different clusterings resemble each other to distinguish any significant differences. The Rand Index calculated for all different configurations of 2 selected variables seems to indicate a similar pattern in clustering accross various configurations.

#clustering according to only two selected variables:
#variance (v), skewness (s), curtosis (c), entropy (e)
set.per1<-data[,1:2] #v_s
set.per2<-data[,2:3] #s_c
set.per3<-data[,3:4] #c_e
set.per4<-data[,1:4] #v_e
set.per5<-data[,1:3] #v_c
set.per6<-data[,2:4] #s_e

v_s<-cclust(set.per1, 4, dist="euclidean")
s_c<-cclust(set.per2, 4, dist="euclidean") 
c_e<-cclust(set.per3, 4, dist="euclidean") 
v_e<-cclust(set.per4, 4, dist="euclidean") 
v_c<-cclust(set.per5, 4, dist="euclidean") 
s_e<-cclust(set.per6, 4, dist="euclidean") 

par(mar=c(6,6,6,6))
library(RColorBrewer)
library(unikn)
p6<-brewer.pal(n=9, name="YlGn")
mix6<-usecol(pal=c("white", p6))

vec.coef<-c("v_s", "s_c", "c_e", "v_e", "v_c", "s_e")
tab.ri<-matrix(0, nrow=6, ncol=6)
colnames(tab.ri)<-c("v_s", "s_c", "c_e", "v_e", "v_c", "s_e")
rownames(tab.ri)<-c("v_s", "s_c", "c_e", "v_e", "v_c", "s_e")

for(i in 1:6){
  for(j in 1:6){
    rix<-randIndex(get(vec.coef[i]), get(vec.coef[j]))
    tab.ri[i,j]<-rix}}
diag(tab.ri)<-NA

library(plot.matrix)
plot(tab.ri, col=mix6, main="Rand Index for clusters according to 2 variables", xlab=" ", ylab=" ")

Considering all of the analysis conducted until this point, it appears that potentially the most optimal number of clusters might be 2 or 3, according to the silhouette, WSS and gap statistic measures, as well as the automatic selection algorithms. For the final analysis between the two, the revision of some important statistics shown below will be followed by 2 more statistical tests: Calinski-Harabasz and Duda-Hart.

#k=2: cluster sizes, average distances, average silhouette width
d<-dist(data)
complete2<-cutree(hclust(d),2)
c.stat2<-cluster.stats(d,complete2)
c.stat2$cluster.size

## [1] 1247  125

c.stat2$average.distance

## [1] 8.653993 5.158220

c.stat2$avg.silwidth

## [1] 0.4633327

#k=3: cluster sizes, average distances, average silhouette width
d<-dist(data)
complete3<-cutree(hclust(d),3)
c.stat3<-cluster.stats(d,complete3)
c.stat3$cluster.size

## [1] 520 727 125

c.stat3$average.distance

## [1] 6.285127 6.588514 5.158220

c.stat3$avg.silwidth

## [1] 0.3497487

The Calinski-Harabasz index evaluates and scores various clustering results based on their ratio of the between-cluster dispersion and the within-cluster dispersion. The results for the K-means method show the highest score (1423.67) for 2 clusters. However, the outcomes for both PAM and Clara solutions indicate that k=3 produces the highest quality of clustering.

#for k-means
km1<-kmeans(data, 2) # stats::
round(calinhara(data, km1$cluster),digits=2)

## [1] 1423.67

km2<-kmeans(data, 3) # stats::
round(calinhara(data, km2$cluster),digits=2)

## [1] 1409.26

km2<-kmeans(data, 4) # stats::
round(calinhara(data, km2$cluster),digits=2)

## [1] 1201.12

km2<-kmeans(data, 5) # stats::
round(calinhara(data, km2$cluster),digits=2)

## [1] 1109.65

#for pam 
pm1<-pam(data, 2)
round(calinhara(data, pm1$cluster),digits=2)

## [1] 1332.53

pm2<-pam(data, 3)
round(calinhara(data, pm2$cluster),digits=2)

## [1] 1366.44

pm3<-pam(data, 4)
round(calinhara(data, pm3$cluster),digits=2)

## [1] 1155.1

pm4<-pam(data, 5)
round(calinhara(data, pm4$cluster),digits=2)

## [1] 1085.85

#for clara 
cl1<-eclust(data, "clara", k=2, graph=FALSE) 
round(calinhara(data, cl1$clustering),digits=2)

## [1] 1257.63

cl2<-eclust(data, "clara", k=3, graph=FALSE) 
round(calinhara(data, cl2$clustering),digits=2)

## [1] 1387.88

cl3<-eclust(data, "clara", k=4, graph=FALSE) 
round(calinhara(data, cl3$clustering),digits=2)

## [1] 1158.63

cl4<-eclust(data, "clara", k=5, graph=FALSE) 
round(calinhara(data, cl4$clustering),digits=2)

## [1] 1067.68

For the final evaluation, the Duda-Hart test is implemented, in order to check if a given dataset should be split into additional clusters, based on the homogeneity of the input. Calculating the criterion and null hypothesis of homogeneity of the data for all 3 different clustering methods allows to make a conclusion that the alternative hypothesis of heterogeneity needs to be accepted for k=2 ($cluster1=FALSE). This result implies that according to the test, in the case of K-means, PAM and Clara the 2 cluster solution should be split into 3 clusters.

#duda-hart test - should split 2 clusters in every one
km1<-kmeans(data,2) 
dh1<-dudahart2(data, km1$cluster)
dh1$cluster1

## [1] FALSE

pm1<-pam(data, 2)
dh2<-dudahart2(data, pm1$cluster)
dh2$cluster1

## [1] FALSE

cl1<-eclust(data, "clara", k=2, graph=FALSE) 
dh3<-dudahart2(data, cl1$clustering)
dh3$cluster1

## [1] FALSE

The conducted analysis included many various methods of assessing and comparing qualities of different clustering specifications. The result differed across various techniques and also across the clustering methods themselves, however considering all of the displayed outcomes , it seems like the most optimal number of clusters might be fixed at the level 3. Clustering solutions with the predetermined value k=3 will now be used to showcase simple predictive possibilities for future data.

5. Predictions

After successfully applying well-fitted and efficient clustering methods, it is also possible to use predictive functions to possibly quickly asses potential future data. In order to present the predictive cluster assignments and their quality, in the fist step it is necessary to divide the dataset into a training and a testing subsets. According to common practice the test part will contain about 10% of the observations.

#divide dataset into two parts: training and test (last 10% of observations)
set.full<-data
set.train<-set.full[c(1:1242), ] 
set.test<-set.full[c(1243:1372), ]

In the next steps, with the help of the package “flexclust” both the train and the test subsets will be clustered again, and than the train data will be used to form predicitons concerning the appropriate assigned clusters for test observations.

library(flexclust)
km.train<-eclust(set.train, "kmeans", hc_metric="euclidean",k=3, graph = FALSE)
km.train.kcca<-as.kcca(km.train, set.train) # conversion to kcca
km.pred<-predict(km.train.kcca, set.test) # prediction for k-means
km.test<-eclust(set.test, "kmeans", hc_metric="euclidean",k=3, graph = FALSE)

Lastly, in order to check the quality of the predictions, the assumed clusters will be compared with the real assigned values and all of the error will be sumed. The calculations below indicated, that only 1 mistake have been made in the predicitve process for the K-means clustering.

#errors
err.kmeans=0
for (i in 1:length(km.test$cluster)){
  if (km.pred[i]!=km.test$cluster[i]){
    err.kmeans=err.kmeans+1
  }
  return(err.kmeans)
}
err.kmeans

## [1] 1

Additionally, the same quick analysis can be conducted for both PAM and Clara clustering to check if the outcomes will differ.

pam.train<-eclust(set.train, "pam", hc_metric="euclidean",k=3, graph = FALSE)
pam.train.kcca<-as.kcca(pam.train, set.train) # conversion to kcca
pam.pred<-predict(pam.train.kcca, set.test) # prediction for k-means
pam.test<-eclust(set.test, "pam", hc_metric="euclidean",k=3, graph = FALSE)
clara.train<-eclust(set.train, "clara", hc_metric="euclidean",k=3, graph = FALSE)
clara.train.kcca<-as.kcca(clara.train, set.train) # conversion to kcca
clara.pred<-predict(clara.train.kcca, set.test) # prediction for k-means
clara.test<-eclust(set.test, "clara", hc_metric="euclidean",k=3, graph = FALSE)

err.pam=0
for (i in 1:length(pam.test$clustering)){
  if (pam.pred[i]!=pam.test$clustering[i]){
    err.pam=err.pam+1
  }
  return(err.pam)
}
err.clara=0
for (i in 1:length(clara.test$clustering)){
  if (clara.pred[i]!=clara.test$clustering[i]){
    err.clara=err.clara+1
  }
  return(err.clara)
}

#errors for PAM and Clara
err.pam

## [1] 1

err.clara

## [1] 1

The predictive functions for both PAM and Clara methods also resulted in 1 erroneous cluster value for all of the test data, showing a good quality of predictions, which while simple, may serve as a useful tool in adding new data into already predefined clustering specifications.

6. Conclusion

The conducted analysis aimed to showcase effective clustering solutions using the example of banknote authentication data. The variables describing the variance, skewness, kurtosis and entropy of banknote pictures were used to find the most efficient clustering options, that could potentially result in easier detection of potential freud and allow for future increase in analysis quality by allowing implementations of different algorithms for different observation clusters. After demonstrating the K-means, PAM and Clara clustering methods, a series of quality measures were applied in hope of deciding on the most optimal number of clusters for the considered dataset. The selected techniques displayed varying results, however finally the best choice of a predetermined value of clusters was set to be 3. The conducted analysis not only shows the process of effective clustering that can be applied to other datasets, but also gives insight into potential uses of the clustering methods through the examples of banknote data.

Clustering project

Weronika Wyrwas

2025-02-06