library(bookdown)

1 Introduction

In the era of globalization, the topic of division between different regions is still being discussed and countries in terms of many aspects. Globalization is the process of “widening and deepening the interdependence between countries and regions as a result of increasing international flows and the activities of transnational corporations“ (Fiedor, Matysiak 2005). Globalization can therefore be understood as a special kind of interdependence between countries or their groups. The aim of this topic will be to select groups from among 45 countries in Europe and Central Asia, thanks to which it will be possible to assess which countries are similar to each other in terms of the factors selected.

There is a long tradition of research related to the agglomeration of economic activity at the geographical plane, starting with Marshall (1890). The tradition was recently restored by New Economic Geography, a study of the location, distribution and spatial organization of economic activities around the world.

Currently, clusters are not only the realities of European economies, but also they are increasingly becoming a policy leverage in different geographic areas. The interest in clusters is growing because, in addition to describing the economic realities, they are also the subject of action. In recent years, they have become the focal point of many new political initiatives, both in Europe and around the world.

The problem of finding differences and similarities between countries has been discussed in many studies. An example can be the analysis of Mircea Gligor and Marcel Ausloos on convergence and clusters in the European Union based on macroeconomic indicators, work by Rosina Moreno, Raffaele Paci, Stefano Usai on innovation clusters in regions of Europe, or the study by Christine M. Aumayr entitled European Region Types in EU-25.

Following the literature, it was decided to use cluster analysis methods in this work, considering them the most appropriate from the point of view of the research problem. “Cluster analysis is a group of methods for creating a meaningful and interpretable classification of an initially unclassified dataset using the values of observable variables at the level of each individual object (Everitt, 1998).” In this article will be performed a cluster analysis, dividing the countries of Europe and Central Asia into segments. Then it was tried to assess which method turned out to be better in the analyzed case.

2 Research hypotheses

It was assumed that during the grouping, a cluster will emerge covering the countries created after the collapse of the USSR - characterized by lower values of indicators describing welfare and technological development, but indicating a significant development potential.

Another group may be the countries of the “old” EU with a relatively stable economy, with the highest welfare and quality of life, with negative or very low population growth, with advanced technological development, with a modern structure of the economy.

It was also assumed that the countries of the “new” Union would constitute a separate group, in which the indicators included in the study had intermediate values compared to the two previous groups.

It is possible that a single cluster will be created by Great Britain, remaining “on the sidelines” of EU.

3 Database and description of variables

The article is based on data from the World Bank’s WDI (World Development Indicators) database (http://ddp-ext.worldbank.org/ext/DDPQQ/member.do?method=getMembers&userid=1&queryId=135) covering the year 2019.

library(readxl) 
library(stats)

setwd("/Users/nehrebeckiwp.pl/Desktop/UL_cluster")
data1 <- read_excel("dane_p.xlsx")
data = data1[,-1]

Data concern the countries of Europe and Central Asia. Due to shortcomings in the database, the study covers 45 countries from group 69. It includes the following variables (all of them are distributed continuously):

Foreign direct investmentnet in flow - net proceeds from foreign direct investment expressed in US dollars. The distribution of the variable is leptokurtic and characterized by right-hand side asymmetry.

summary(data$BoP)

##       Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
## -1.099e+10  3.537e+08  1.856e+09  7.277e+09  4.521e+09  8.010e+10

High-technology exports of manufacture - share of high-tech goods in exported goods expressed as a percentage. The distribution of the variable is leptokurtic and characterized by right-hand side asymmetric.

summary(data$HighTech)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.097   4.675   8.057  11.572  17.305  39.987

Mobile cellular subscriptions per 10 - number of mobile phone owners per 100 people. The variable has a left-asymmetric distribution, the values of the feature are less concentrated than in the normal distribution.

summary(data$Mobile)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.051  40.649  79.428  65.063  94.876 107.864

CO2 emissions metrictons per capita - carbon dioxide emission value in tonnes per person. The distribution of the variable is right-side asymmetric and leptokurtic.

summary(data$CO2)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.750   4.565   6.722   7.002   9.246  24.147

Electric power consumption kwh per ca - value of electricity consumption per person in kWh. The variable is characterized by a right-asymmetric distribution, the values of the feature are more concentrated than in the normal distribution.

summary(data$Electric)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1224    2551    4526    6092    6831   28213

Inflation gdp deflator annual - annual inflation rate. The distribution of the variable is right-hand side asymmetric and leptokurtic.

summary(data$Inflation)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2296  2.4028  4.0887  6.1979  7.9765 22.6751

GDP growth annual - annual GDP growth rate. The variable is characterized by a right-asymmetric, leptokurtic distribution.

summary(data$GDP)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.059   3.728   5.156   6.002   7.700  17.200

GNP per capita ppp current internation - gross national income, expressed in US dollars. The distribution of the variable is right-hand side asymmetric and platokurtic.

summary(data$GNI)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1310    7240   14050   18067   30110   56430

Imports of goods and services of GDP - share of imports of goods and services in GDP. The variable is characterized by a right asymmetry, leptokurtic distribution.

summary(data$Imports)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   22.14   36.23   46.71   51.47   66.53  126.47

Industry value added of GDP - industry share in GDP. The distribution of the variable is right-hand side asymmetric and leptokurtic.

summary(data$Industry)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   16.89   24.95   29.16   29.92   34.95   54.72

Population growth annual - annual population growth rate. The distribution of the variable is characterized by left-hand side asymmetry and leptokurticism.

summary(data$Population)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -1.7731 -0.2214  0.3459  0.3037  0.6985  1.8043

Exports of goods and services of GDP - share of exports in GDP. The variable has a right asymmetry, leptokurtic distribution.

summary(data$Exports)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   21.54   34.12   43.96   48.27   58.31  148.51

As can be seen from the basic descriptive statistics presented above, the distributions of variables differ significantly from the normal distribution. In addition, many variables are characterized by large values of standard deviations in relation to the mean, testifying with significant diversification of the values of variables. So the values of variables are on different scales the variables will be normalized using scale() function in order to get proper and interpretable results.

data <- scale(data)

It is useful to analyse the relationship between variables in our dataset.

library(corrplot)

## corrplot 0.84 loaded

data_matrix <- data.matrix(data, rownames.force = NA)
Matrix <- cor(data_matrix)
corrplot(Matrix, method = "number", number.cex = 0.75, order="hclust")

Based on the above mentioned matrix highly correlated between variables can be observed. In other to check if the dataset contains outliers, it is useful to calculate interquartile range statistic.

vars <- c(colnames(data))
Outliers <- c()
for(i in vars){
  max <- quantile(data[,i], 0.75) + (IQR(data[,i]) * 1.5 )
  min <- quantile(data[,i], 0.25) - (IQR(data[,i]) * 1.5 )
  idx <- which(data[,i] < min | data[,i] > max)
  print(paste(i, length(idx), sep=' ')) # verification of potential outliers
  Outliers <- c(Outliers, idx) 
}

## [1] "BoP 12"
## [1] "HighTech 1"
## [1] "Internet 0"
## [1] "Mobile 0"
## [1] "CO2 1"
## [1] "Electric 5"
## [1] "Inflation 3"
## [1] "GDP 1"
## [1] "GNI 0"
## [1] "Imports 1"
## [1] "Industry 1"
## [1] "Service 1"
## [1] "Population 1"
## [1] "Exports 1"

Based on the results some potential outliers points can be observed. So, to decide whether the above observations should be discarded from the survey data.

Plots of the variables with covered outlies

par(mfrow=c(2,2))
colnames <- colnames(data[,c(1:2,6:7)])
for (i in colnames) {
  plot(data[,i], main = paste("Plot of ", i), ylab = i)
}

Plots of the variables with covered outlies

On the basis of the presented charts, outliers were analyzed and it was found that countries with unusual values were not rejected.

4 Cluster analysis

In the further part of the work, a cluster analysis will be conducted. The aim of this work is to divide the countries of Europe and Central Asia into segments.

4.1 Prediagnostics

Before clustering it is useful to do prediagnostics.

library(clustertend)
library(factoextra)

## Loading required package: ggplot2

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

get_clust_tendency(data, 2, graph=TRUE, gradient=list(low="red", mid="white", high="blue"))

## $hopkins_stat
## [1] 0.6806388
## 
## $plot

Hopkins statistic (Lawson and Jurs 1990) The null and alternative hypotheses of the Hopkins statistic (Lawson and Jurs 1990) are defined as follows:

Null hypothesis: the dataset under consideration is uniformly distributed (i.e. no significant clusters)
Alternative hypothesis: the dataset is not evenly distributed (i.e. contains significant clusters)

The Hopkins statistic test can be performed iteratively using 0.5 as the threshold to reject the alternative hypothesis. This means that if H <0.5 it is unlikely that D has statistically significant clusters.

In other words, if the value of the Hopkins statistic is close to 1, then the null hypothesis is rejected and the dataset is largely clusterable data.

According to the Hopkins statistics, which has a value of 0.68, it was found that the prepared data set is largely clustered. The above conclusion also confirms the ordered dissimilarity plot. On the basis of the graph, one should also notice different-colored fields, this is a reason for the possibility of finding clusters in the considered data set.

In order to apply clustering, it is necessary to determine the optimal number of clusters. The silhouette statistic will be applied to three planar clustering algorithms: k-means, PAM and CLARA, hierarchical clustering, and fuzzy clustering.

library(gridExtra)
a <- fviz_nbclust(data, FUNcluster = kmeans, method = "silhouette") + theme_classic() 
b <- fviz_nbclust(data, FUNcluster = cluster::pam, method = "silhouette") + theme_classic() 
c <- fviz_nbclust(data, FUNcluster = cluster::clara, method = "silhouette") + theme_classic() 
d <- fviz_nbclust(data, FUNcluster = hcut, method = "silhouette") + theme_classic() 
e <- fviz_nbclust(data, FUNcluster = cluster::fanny, method = "silhouette") + theme_classic()

## Warning in FUNcluster(x, i, ...): the memberships are all very close to 1/k.
## Maybe decrease 'memb.exp' ?

## Warning in FUNcluster(x, i, ...): the memberships are all very close to 1/k.
## Maybe decrease 'memb.exp' ?

## Warning in FUNcluster(x, i, ...): the memberships are all very close to 1/k.
## Maybe decrease 'memb.exp' ?

## Warning in FUNcluster(x, i, ...): the memberships are all very close to 1/k.
## Maybe decrease 'memb.exp' ?

## Warning in FUNcluster(x, i, ...): the memberships are all very close to 1/k.
## Maybe decrease 'memb.exp' ?

## Warning in FUNcluster(x, i, ...): the memberships are all very close to 1/k.
## Maybe decrease 'memb.exp' ?

grid.arrange(a, b, c, d,e, ncol=3)

According to Silhouette statistic, the optimal number of clusters is 2 for flat, for hierarchical algorithms is three and 8 for fuzzy.

4.2 Flat clustering

When using partition clustering, the most popular algorithms are k-average, PAM, and CLARA. In this article, the following algorithms are mastered: k-means and PAM. This is because CLARA is an implementation of the PAM algorithm for big data. The considered data set is classified as a smaller data set.

4.2.1 K-Means Method

The k-means method is one of the simplest unsupervised learning algorithms that solves the problem of grouping objects. Grouping consists of initial dividing the analyzed community into a predetermined number of classes. Once the target number of clusters has been determined, the main consideration is to initially define the centers of gravity for each cluster. So a good choice is to place them as far apart as possible from each other. The basic algorithm can be represented as follows: • Arranging the division of objects into groups. • Random selection of cluster centers. • Assignment of points to the nearest centroids. • Calculation of new cluster measures. • Repeating the algorithm until the convergence criterion is reached. Most often, this is the step in which the class allocation of points has not changed. The algorithm’s goal is to minimize the following objective function:

where is the chosen distance measure between observations x and the mean of the cluster it is assigned to.

I will use the k-means method to determine the number of groups in the set analyzed. In contrast to hierarchical methods, the k-means method is less sensitive to atypical and non-differentiating observations, and is better at dealing with large data sets.

Pearson correlation

library(factoextra)
cl_kmeans <- eclust(data, k=2, FUNcluster="kmeans", hc_metric="pearson", graph=FALSE)
a <- fviz_silhouette(cl_kmeans)

##   cluster size ave.sil.width
## 1       1   23          0.30
## 2       2   22          0.26

groups = cl_kmeans$cluster
data1$groups = cl_kmeans$cluster
barplot(table(groups), col="blue")

b <- fviz_cluster(cl_kmeans, data = data, elipse.type = "convex") + theme_minimal()
grid.arrange(a, b, ncol=2)

Euclidean distance

cl_kmeans1 <- eclust(data, k=2, FUNcluster="kmeans", hc_metric="euclidean", graph=FALSE)
g <- fviz_silhouette(cl_kmeans1)

##   cluster size ave.sil.width
## 1       1   23          0.30
## 2       2   22          0.26

groups1 = cl_kmeans1$cluster
barplot(table(groups1), col="blue")

h <- fviz_cluster(cl_kmeans1, data = data, elipse.type = "convex") + theme_minimal()
grid.arrange(g, h, ncol=2)

Mahalanobis distance

# k-means with Mahalanobis distance

S_x <- cov(data)
iS <- solve(S_x) 
e <- eigen(iS)
V <- e$vectors
B <- V %*% diag(sqrt(e$values)) %*% t(V)
Xtil <- scale(data,scale = FALSE)
XS <- Xtil %*% B

# kmeans with 2 clusters
fit = kmeans(data, centers=2, nstart=100)
groups = fit$cluster
barplot(table(groups), col="blue")

In conclusion, on the basis of the obtained results, it was found that there were no significant differences between the Pearson correlation and the Euclidean distance. Silhouette statistics for both approaches are equal to 0.28.

4.2.2 PAM

Pearson correlation

cl_pam <- eclust(data, k=2, FUNcluster="pam", hc_metric="pearson", graph=FALSE)
c <- fviz_silhouette(cl_pam)

##   cluster size ave.sil.width
## 1       1   21          0.29
## 2       2   24          0.26

groups_pam = cl_pam$cluster
barplot(table(groups_pam), col="blue")

d <- fviz_cluster(cl_pam, data = data, elipse.type = "convex") + theme_minimal()
grid.arrange(c, d, ncol=2)

Euclidean distance

cl_pam1 <- eclust(data, k=2, FUNcluster="pam", hc_metric="euclidean", graph=FALSE)
i <- fviz_silhouette(cl_pam1)

##   cluster size ave.sil.width
## 1       1   21          0.29
## 2       2   24          0.26

groups_pam1 = cl_pam1$cluster
barplot(table(groups_pam1), col="blue")

j <- fviz_cluster(cl_pam1, data = data, elipse.type = "convex") + theme_minimal()
grid.arrange(i, j, ncol=2)

Mahalanobis distance

library(ClusterR)

## Loading required package: gtools

cl_pam2 <- Cluster_Medoids(data, 2, distance_metric = "mahalanobis", verbose = FALSE, seed = 1234)
Silhouette_Dissimilarity_Plot(cl_pam2, silhouette = TRUE)

## [1] TRUE

On the basis of the obtained results related to the application of the PAM implementation, it was obtained that there was also no difference between the Pearson correlation and the Euclidean distance, similarly to the k-means. The silhouette statistic for both is 0.28. It should be noted that the Silhouette statistic for is much lower (0.038). In conclusion, it should be noted that there are no clear differences in the application of a distance other than the Euclidean distance.

On the basis of two approaches of k-means and PAM, it was obtained that PAM with Euclidean distance and k-means with Euclidean distance are similary in terms of efficiency.

4.2.3 Validation

As part of the validation of the quoted clusters, the stability of selected clusters should also be analyzed. Cluster stability should be verified for the following cases: * creating sub-samples, * adding noise to data.

The bootstrap method (clusterboot ()) will be used for this.

```r
library(gclus)
```

```
## Loading required package: cluster
```

```r
library(ggplot2)
library(dplyr)
```

```
## 
## Attaching package: 'dplyr'
```

```
## The following object is masked from 'package:gridExtra':
## 
##     combine
```

```
## The following objects are masked from 'package:stats':
## 
##     filter, lag
```

```
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
```

```r
library(fpc)  

set.seed(20)
cboot.hclust <- clusterboot(data, B=100, bootmethod="boot",
                    clustermethod=pamkCBI,
                    krange=2, seed=20)
```

```
## boot 1 
## boot 2 
## boot 3 
## boot 4 
## boot 5 
## boot 6 
## boot 7 
## boot 8 
## boot 9 
## boot 10 
## boot 11 
## boot 12 
## boot 13 
## boot 14 
## boot 15 
## boot 16 
## boot 17 
## boot 18 
## boot 19 
## boot 20 
## boot 21 
## boot 22 
## boot 23 
## boot 24 
## boot 25 
## boot 26 
## boot 27 
## boot 28 
## boot 29 
## boot 30 
## boot 31 
## boot 32 
## boot 33 
## boot 34 
## boot 35 
## boot 36 
## boot 37 
## boot 38 
## boot 39 
## boot 40 
## boot 41 
## boot 42 
## boot 43 
## boot 44 
## boot 45 
## boot 46 
## boot 47 
## boot 48 
## boot 49 
## boot 50 
## boot 51 
## boot 52 
## boot 53 
## boot 54 
## boot 55 
## boot 56 
## boot 57 
## boot 58 
## boot 59 
## boot 60 
## boot 61 
## boot 62 
## boot 63 
## boot 64 
## boot 65 
## boot 66 
## boot 67 
## boot 68 
## boot 69 
## boot 70 
## boot 71 
## boot 72 
## boot 73 
## boot 74 
## boot 75 
## boot 76 
## boot 77 
## boot 78 
## boot 79 
## boot 80 
## boot 81 
## boot 82 
## boot 83 
## boot 84 
## boot 85 
## boot 86 
## boot 87 
## boot 88 
## boot 89 
## boot 90 
## boot 91 
## boot 92 
## boot 93 
## boot 94 
## boot 95 
## boot 96 
## boot 97 
## boot 98 
## boot 99 
## boot 100
```

cboot.hclust$bootmean

## [1] 0.8554125 0.8693315

On the basis of the bootstrap procedure performed to assess the stability of the defined clusters, it should be noted that all selected clusters are stable, because some averages from a given cluster are close to 1 (a perfectly stable cluster).

cboot.hclust$bootbrd

## [1] 1 3

In order to analyze the obtained results in detail, the bootstrap procedure can be used to analyze the “bootbrd” measure, which determines how many times the cluster has been dissolved. Based on 100 iterations, how many clusters have been resolved.

cboot.hclust$bootrecover

## [1] 76 76

The “bootrecover” measure displays the number of clusters that have been successfully recovered.

In summary, the clusters obtained by the PAM approach are stable.

4.3 Hierarchical methods

There are two types of hierarchical clustering: agglomerative (bottom-up approach: HAC or AGNES) and divisive (top-down approach: DIANA).

The Ward method is one of the agglomeration grouping methods. It is distinguished by taking into account the analysis of variance when estimating the distance between clusters. It consists in minimizing the sum of squared deviations of any two clusters that may be formed at each stage. It is effective, but often small-sized clusters are created with it. It allows you to control the number of groups and shows the most natural clusters of elements. This method requires in turn: determining the distance matrix containing the distances of each pair of objects, then finding such pairs for which the distance is the smallest and combining them into one cluster, and then determining new distances between the newly created cluster and others. These activities are repeated until all units are combined into one cluster. Then a tree diagram is created - the dendrogram. The horizontal axis shows the distances between clusters. On its basis, it is possible to read the order of combining objects as well as determine any number of groups and determine the number and composition of individual groups. Dendrogram obtained for the countries of Europe and Central Asia based on the previously described variables, it is as follows.

4.3.1 Agglomerative hierarchical clustering

Single linkage

hc <- eclust(data, k=3, FUNcluster="hclust", hc_metric="euclidean", hc_method = "single")
plot(hc, cex=0.6, hang=-1, main = "Dendrogram of HAC - single")
rect.hclust(hc, k=3, border='red')

Complete linkage

hc1 <- eclust(data, k=3, FUNcluster="hclust", hc_metric="euclidean", hc_method = "complete")
plot(hc1, cex=0.6, hang=-1, main = "Dendrogram of HAC - complete")
rect.hclust(hc1, k=3, border='red')

Average linkage

hc2 <- eclust(data, k=3, FUNcluster="hclust", hc_metric="euclidean", hc_method = "average")
plot(hc2, cex=0.6, hang=-1, main = "Dendrogram of HAC - average")
rect.hclust(hc2, k=3, border='red')

Ward’s method

hc3 <- eclust(data, k=3, FUNcluster="hclust", hc_metric="euclidean", hc_method = "ward.D2")
plot(hc3, cex=0.6, hang=-1, main = "Dendrogram of HAC - Ward")
rect.hclust(hc3, k=3, border='red')

clusterCut3 <- cutree(hc3, 3)
table(data1$countryname, clusterCut3)

##                     clusterCut3
##                      1 2 3
##   Albania            1 0 0
##   Armenia            1 0 0
##   Austria            0 1 0
##   Azerbaijan         1 0 0
##   Belarus            1 0 0
##   Belgium            0 1 0
##   Bulgaria           0 1 0
##   Croatia            0 1 0
##   Czech Republic     0 1 0
##   Denmark            0 1 0
##   Estonia            0 1 0
##   Finland            0 1 0
##   France             0 1 0
##   Georgia            0 1 0
##   Germany            0 1 0
##   Greece             0 1 0
##   Hungary            0 1 0
##   Iceland            0 1 0
##   Ireland            0 1 0
##   Italy              0 1 0
##   Kazakhstan         1 0 0
##   Kyrgyz Republic    1 0 0
##   Latvia             0 1 0
##   Lithuania          0 1 0
##   Luxembourg         0 0 1
##   Macedonia          0 1 0
##   Moldova            0 1 0
##   Netherlands        0 1 0
##   Norway             0 1 0
##   Poland             0 1 0
##   Portugal           0 1 0
##   Romania            1 0 0
##   Russian Federation 1 0 0
##   Serbia             1 0 0
##   Slovak Republic    0 1 0
##   Slovenia           0 1 0
##   Spain              0 1 0
##   Sweden             0 1 0
##   Switzerland        0 1 0
##   Tajikistan         1 0 0
##   Turkey             1 0 0
##   Turkmenistan       1 0 0
##   Ukraine            1 0 0
##   United Kingdom     0 1 0
##   Uzbekistan         1 0 0

In sum, on the basis of several approaches to clustering, it was found that the Ward method is the most appropriate, as it was assumed earlier. Continuing the results obtained with the Ward’s method, the “fpc” library was used.

library(fpc)
dd <- dist(data, method ="euclidean")
hc_stats <- cluster.stats(dd, hc3$cluster)
# number of observations in each cluster
hc_stats$cluster.size

## [1] 14 30  1

On the basis of the Ward’s method used, 3 clusters were obtained, which consist of the following observations: 14, 30, 1.

# within cluster sum of squares
hc_stats$within.cluster.ss

## [1] 358.4149

The within sum of squares the cluster is equal to 358.4149.

hc_stats$avg.silwidth # average silhouette width

## [1] 0.2693754

The average silhouette width is equal to 0.2694.

hc_stats$clus.avg.silwidths # average silhouette widths for each cluster

##         1         2         3 
## 0.3245755 0.2525945 0.0000000

The average silhouette widths for each cluster are 0.32, 0.25, and 0.00.

For verification purposes, it is also worth using the statistics on the effectiveness of clustering, i.e. the Dunn Index: as the Dunn index increases, clustering is better.

hc_stats$min.separation

## [1] 2.277796

hc_stats$max.diameter

## [1] 8.427987

dunn <- hc_stats$min.separation / hc_stats$max.diameter
cat("Dunn Index is equal to", round(dunn, 2))

## Dunn Index is equal to 0.27

It is also worth analyzing the agglomeration coefficient. The above factor is responsible for the structure of clusters: the smaller the value, the tightly concentrated.

library(cluster)
cat("Agglomerative coefficient is equal to", round(coef.hclust(hc3), 2))

## Agglomerative coefficient is equal to 0.86

4.3.2 Divisive hierarchical clustering

hc4 <- eclust(data, k=3, FUNcluster="diana")
pltree(hc4, cex = 0.6, hang = -1, main = "Dendrogram of DIANA")
rect.hclust(hc4, k=3, border='red')

hc_stats1 <- cluster.stats(dd, hc4$cluster)
hc_stats1$cluster.size # number of observations per cluster

## [1] 20 24  1

On the basis of the Diana’s method used, 3 clusters were obtained, which consist of the following observations: 20, 24, 1.

hc_stats1$within.cluster.ss # within cluster sum of squares

## [1] 336.167

The within sum of squares the cluster is equal to 336.167.

hc_stats1$avg.silwidth # average silhouette width

## [1] 0.2805902

The average silhouette width is equal to 0.2109171.

hc_stats1$clus.avg.silwidths # average silhouette widths for each cluster

##         1         2         3 
## 0.2580614 0.3110555 0.0000000

The average silhouette widths for each cluster are 0.25, 0.31, and 0.00.

dunn <- hc_stats1$min.separation / hc_stats1$max.diameter
cat("Dunn Index is equal to", round(dunn, 2))

## Dunn Index is equal to 0.25

cat("Divisive coefficient is equal to", round(coef(hc4), 2))

## Divisive coefficient is equal to 0.79

Finally, it is necessary to use the bootstrap method to verify the stability of the results. Ward’s method was chosen as the most appropriate for the bootstrap in terms of the results obtained.

library(gclus)
library(ggplot2)
library(dplyr)
library(fpc)   
 #set seed - for random                              
set.seed(20)
cboot.hclust1 <- clusterboot(data, B=100, 
                             method="ward.D", 
                        clustermethod=hclustCBI,
                        k=3, seed=20)

## boot 1 
## boot 2 
## boot 3 
## boot 4 
## boot 5 
## boot 6 
## boot 7 
## boot 8 
## boot 9 
## boot 10 
## boot 11 
## boot 12 
## boot 13 
## boot 14 
## boot 15 
## boot 16 
## boot 17 
## boot 18 
## boot 19 
## boot 20 
## boot 21 
## boot 22 
## boot 23 
## boot 24 
## boot 25 
## boot 26 
## boot 27 
## boot 28 
## boot 29 
## boot 30 
## boot 31 
## boot 32 
## boot 33 
## boot 34 
## boot 35 
## boot 36 
## boot 37 
## boot 38 
## boot 39 
## boot 40 
## boot 41 
## boot 42 
## boot 43 
## boot 44 
## boot 45 
## boot 46 
## boot 47 
## boot 48 
## boot 49 
## boot 50 
## boot 51 
## boot 52 
## boot 53 
## boot 54 
## boot 55 
## boot 56 
## boot 57 
## boot 58 
## boot 59 
## boot 60 
## boot 61 
## boot 62 
## boot 63 
## boot 64 
## boot 65 
## boot 66 
## boot 67 
## boot 68 
## boot 69 
## boot 70 
## boot 71 
## boot 72 
## boot 73 
## boot 74 
## boot 75 
## boot 76 
## boot 77 
## boot 78 
## boot 79 
## boot 80 
## boot 81 
## boot 82 
## boot 83 
## boot 84 
## boot 85 
## boot 86 
## boot 87 
## boot 88 
## boot 89 
## boot 90 
## boot 91 
## boot 92 
## boot 93 
## boot 94 
## boot 95 
## boot 96 
## boot 97 
## boot 98 
## boot 99 
## boot 100

```r
cboot.hclust1$bootmean
```

```
## [1] 0.7837763 0.8334346 0.6184061
```

cboot.hclust1$bootbrd

## [1] 12  9 49

cboot.hclust1$bootrecover

## [1] 55 71 37

The “bootrecover” measure displays the number of clusters that have been successfully recovered.

In summary, the clusters obtained by the Ward’s method are stable.

##Fuzzy clustering

Traditional grouping methods are performed in such a way that the selected groups are deterministic aggregates. The limitation of the above-considered approach leads to analytical uncertainty, therefore an alternative is to interpret the system elements as fuzzy objects within the framework of flexibly configurable ordering structures.

At the end of the 80s, the development of technical devices based on fuzzy controllers began, and fuzzy logic became an integral part of modern systems artificial intelligence.

As part of the research, the fanny() function from the cluster package will be used. The following parameters were selected in the fanny() function: * the number of clusters is 8 based on the previous charts, * the membership exponent has been set to two levels 1,2 and 1,5 and * a parameter was selected for the dissimilarity calculation metric Euclidean Distance.

library(cluster)
clust_fanny <- fanny(data, k=8,  memb.exp = 1.2, metric = "euclidean")
head(clust_fanny$membership, n=8)

##              [,1]         [,2]         [,3]         [,4]         [,5]
## [1,] 0.8949120014 3.198436e-03 0.0011626976 0.0096679700 0.0027688770
## [2,] 0.0727148856 8.876800e-01 0.0004492703 0.0271977038 0.0014542999
## [3,] 0.0003859201 9.835691e-05 0.9051752494 0.0003681028 0.0334263357
## [4,] 0.0018639570 9.936422e-01 0.0001181413 0.0029526055 0.0005592213
## [5,] 0.0105199212 1.811445e-02 0.0007287316 0.9574518315 0.0044233585
## [6,] 0.0112454081 4.420043e-03 0.1701117378 0.0139836240 0.5406777848
## [7,] 0.0017257818 3.444006e-04 0.0002866507 0.0034007953 0.0065712334
## [8,] 0.0037497135 3.414259e-04 0.0045562407 0.0025664333 0.0288145272
##             [,6]         [,7]         [,8]
## [1,] 0.083032430 0.0052487317 8.856031e-06
## [2,] 0.009763511 0.0007329706 7.314283e-06
## [3,] 0.003607452 0.0569299886 8.594770e-06
## [4,] 0.000729257 0.0001303311 4.326644e-06
## [5,] 0.008038556 0.0006961234 2.702882e-05
## [6,] 0.071908301 0.1826344090 5.018693e-03
## [7,] 0.987227885 0.0004416432 1.610220e-06
## [8,] 0.948944673 0.0110239340 3.051898e-06

To assess the fuzziness measure of the resulting classification, the Dunn partition coefficient is used:

clust_fanny$coeff

## dunn_coeff normalized 
##  0.8419518  0.8193734

Dunn partition coefficient which takes the minimum value when the partition is completely fuzzy, otherwise the distance from each object to the center of gravity of any cluster is equal. On the contrary, in the case of clear clustering, the Dunn coefficient takes the value 1. In the analysis Dunn coefficient is equal to 0.84. Its normalized version, varying from 0 to 1 and characterizing the degree of fuzziness is equal to 0.81.

a <- fviz_silhouette(clust_fanny)

##   cluster size ave.sil.width
## 1       1    5          0.19
## 2       2    3          0.14
## 3       3    9          0.21
## 4       4    6          0.23
## 5       5    7          0.12
## 6       6    8          0.19
## 7       7    6          0.30
## 8       8    1          0.00

It is useful to change the parameter memb.exp from 1.2 to 1.3.

library(cluster)
clust_fanny1 <- fanny(data, k=8,  memb.exp = 1.3, metric = "euclidean")
head(clust_fanny$membership, n=8)

##              [,1]         [,2]         [,3]         [,4]         [,5]
## [1,] 0.8949120014 3.198436e-03 0.0011626976 0.0096679700 0.0027688770
## [2,] 0.0727148856 8.876800e-01 0.0004492703 0.0271977038 0.0014542999
## [3,] 0.0003859201 9.835691e-05 0.9051752494 0.0003681028 0.0334263357
## [4,] 0.0018639570 9.936422e-01 0.0001181413 0.0029526055 0.0005592213
## [5,] 0.0105199212 1.811445e-02 0.0007287316 0.9574518315 0.0044233585
## [6,] 0.0112454081 4.420043e-03 0.1701117378 0.0139836240 0.5406777848
## [7,] 0.0017257818 3.444006e-04 0.0002866507 0.0034007953 0.0065712334
## [8,] 0.0037497135 3.414259e-04 0.0045562407 0.0025664333 0.0288145272
##             [,6]         [,7]         [,8]
## [1,] 0.083032430 0.0052487317 8.856031e-06
## [2,] 0.009763511 0.0007329706 7.314283e-06
## [3,] 0.003607452 0.0569299886 8.594770e-06
## [4,] 0.000729257 0.0001303311 4.326644e-06
## [5,] 0.008038556 0.0006961234 2.702882e-05
## [6,] 0.071908301 0.1826344090 5.018693e-03
## [7,] 0.987227885 0.0004416432 1.610220e-06
## [8,] 0.948944673 0.0110239340 3.051898e-06

clust_fanny1$coeff

## dunn_coeff normalized 
##  0.6148298  0.5598055

Based on Dunn’s coefficient the value of statistic deteriorated.

b <- fviz_silhouette(clust_fanny1)

##   cluster size ave.sil.width
## 1       1    6          0.15
## 2       2    9          0.21
## 3       3    2          0.23
## 4       4    6          0.23
## 5       5    7          0.12
## 6       6    8          0.19
## 7       7    6          0.30
## 8       8    1          0.00

So in the further part of this paper fanny method with memb.exp equal to 1.2 will be used.

grid.arrange(a, b, ncol=1)

To verify which countries are divided into clusters it is useful to analyze initial statistics.

table(data1$countryname, clust_fanny$clustering)

##                     
##                      1 2 3 4 5 6 7 8
##   Albania            1 0 0 0 0 0 0 0
##   Armenia            0 1 0 0 0 0 0 0
##   Austria            0 0 1 0 0 0 0 0
##   Azerbaijan         0 1 0 0 0 0 0 0
##   Belarus            0 0 0 1 0 0 0 0
##   Belgium            0 0 0 0 1 0 0 0
##   Bulgaria           0 0 0 0 0 1 0 0
##   Croatia            0 0 0 0 0 1 0 0
##   Czech Republic     0 0 0 0 1 0 0 0
##   Denmark            0 0 1 0 0 0 0 0
##   Estonia            0 0 0 0 1 0 0 0
##   Finland            0 0 1 0 0 0 0 0
##   France             0 0 0 0 0 0 1 0
##   Georgia            0 0 0 0 0 1 0 0
##   Germany            0 0 1 0 0 0 0 0
##   Greece             0 0 0 0 0 0 1 0
##   Hungary            0 0 0 0 1 0 0 0
##   Iceland            0 0 1 0 0 0 0 0
##   Ireland            0 0 0 0 1 0 0 0
##   Italy              0 0 0 0 0 0 1 0
##   Kazakhstan         0 0 0 1 0 0 0 0
##   Kyrgyz Republic    1 0 0 0 0 0 0 0
##   Latvia             0 0 0 0 0 1 0 0
##   Lithuania          0 0 0 0 0 1 0 0
##   Luxembourg         0 0 0 0 0 0 0 1
##   Macedonia          0 0 0 0 0 1 0 0
##   Moldova            0 0 0 0 0 1 0 0
##   Netherlands        0 0 1 0 0 0 0 0
##   Norway             0 0 1 0 0 0 0 0
##   Poland             0 0 0 0 0 1 0 0
##   Portugal           0 0 0 0 0 0 1 0
##   Romania            0 0 0 1 0 0 0 0
##   Russian Federation 0 0 0 1 0 0 0 0
##   Serbia             0 0 0 1 0 0 0 0
##   Slovak Republic    0 0 0 0 1 0 0 0
##   Slovenia           0 0 0 0 1 0 0 0
##   Spain              0 0 0 0 0 0 1 0
##   Sweden             0 0 1 0 0 0 0 0
##   Switzerland        0 0 1 0 0 0 0 0
##   Tajikistan         1 0 0 0 0 0 0 0
##   Turkey             1 0 0 0 0 0 0 0
##   Turkmenistan       0 1 0 0 0 0 0 0
##   Ukraine            0 0 0 1 0 0 0 0
##   United Kingdom     0 0 0 0 0 0 1 0
##   Uzbekistan         1 0 0 0 0 0 0 0

An interesting division was obtained using the fanny() algorithm. The resulting division of countries, however, requires more in-depth analysis. Due to the fact that the phases of the methods remain in development, this part of the study is left as the part requiring further investigation.

5 Conclusions

In the conducted analysis, the countries of Europe and Central Asia were divided according to selected macroeconomic indicators. However, the expected split has not been achieved. This may be due to the fact that a large number of variables were taken into account in the study, not all of which differentiated sufficiently the countries covered by the analysis. Some macroeconomic factors significantly differed in the countries selected for analysis, therefore single-element clusters were obtained. Despite this, it was possible to distinguish between highly developed and low-developed countries, while countries with average values of the variables included in the study were less separated.

It’s worth pointing out the clusters obtained by the PAM approach are stable. Nevertheless, in practice it must be admitted and the choice of the appropriate method depends on the achievement of the business goals.

6 Bibliography

Aumayr C. M. (2006) European Region Types in EU-25, The European Journal of Comparative Economics Vol. 4, n. 2, pp. 109-147.
Usai S. , Paci R. , Moreno R. (2004), Innovation clusters in the European regions, ERSA conference papers, number ersa04p587, 2004.
https://www.rdocumentation.org/packages/fpc/versions/2.1-11.1/topics/clusterboot

Cluster analysis - The division of European and Central Asian countries according to selected macroeconomic factors

Marian Nehrebecki

12/26/2020