1) `Project description`:

The goal of this paper is to analyze the evidence of clustering based on the selected data. To this end, I chose several data from Eurostat-database, in order to find out if there are clusters in term of trade in the European Union (28 countries).

2) `Dataset description`

Description of the diferent features:

GDP Gross domestic product at market prices (2018, million euro)
I_TR_EX Intra-EU28 trade, exports (2018, millions of ECU/EURO)
I_TR_IM Intra-EU28 trade, imports (2018, millions of ECU/EURO)
E_TR_EX Extra-EU28 trade, exports (2018, millions of ECU/EURO)
E_TR_IM Extra-EU28 trade, imports (2018, millions of ECU/EURO)
T_TR_I Total intra-EU28 trade (2018, millions of ECU/EURO)
T_TR_E Total extra-EU28 trade (2018, millions of ECU/EURO)
T_TR_IM Total imports-EU28 trade (2018, millions of ECU/EURO)
T_TR_EX Total exports-EU28 trade (2018, millions of ECU/EURO)
T_GDP_R Total trade to GDP ratio (%)
A_T_G Air transport of goods (2018, tonnes)
T_TR_I = I_TR_EX + I_TR_IM
T_TR_E = E_TR_EX + E_TR_IM
T_TR_IM = I_TR_IM + E_TR_IM
T_TR_EX = I_TR_EX + E_TR_EX
T_GDP_R = ((T_TR_IM + T_TR_EX) / GDP) * 10

Data bibligraphy:

GDP Eurostat database (tipsau10) - https://ec.europa.eu/eurostat/web/products-datasets/-/tipsau10

I_TR_EX Eurostat database (tet00047) - https://ec.europa.eu/eurostat/web/products-datasets/-/tet00047

I_TR_IM Eurostat database (tet00047) - https://ec.europa.eu/eurostat/web/products-datasets/-/tet00047

E_TR_EX Eurostat database (tet00055) - https://ec.europa.eu/eurostat/web/products-datasets/-/tet00055

E_TR_IM Eurostat database (tet00055) - https://ec.europa.eu/eurostat/web/products-datasets/-/tet00055

A_T_G Eurostat database (ttr00011) – https://ec.europa.eu/eurostat/web/products-datasets/-/ttr00011

3) `Manipulation of the data`:

First we will load all the necesary packages and the data:

library(cluster)
library(factoextra)
library(flexclust)
library(fpc)
library(clustertend)
library(ClusterR)
library(NbClust)
library(readxl)
library(reshape)
library(psych)
library(dplyr)

## # A tibble: 28 x 6
##    C_EU          GDP I_TR_EX I_TR_IM E_TR_EX E_TR_IM
##    <chr>       <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
##  1 Austria   385712. 111673. 127260.  44756.  36748.
##  2 Belgium   459820. 287689. 245472. 107201  136329.
##  3 Bulgaria   56087.  19275.  20403.   8821.  11702.
##  4 Croatia    51579.  10001.  18557.   4749.   5330.
##  5 Cyprus     21138.   1250.   5277.   3001.   3890.
##  6 Czechia   207772. 144491. 119732.  26769.  36726.
##  7 Denmark   301341.  56594.  60784.  36013   26001 
##  8 Estonia    26036.   9814.  12435.   4611.   3794.
##  9 Finland   234453   37884.  46716.  26352.  19861 
## 10 France   2353090  290669. 392438. 201915. 175901.
## # … with 18 more rows

Finally, the dataset “trade” contains 28 rows, and 12 columns

4) `Pre-diagnosis`:

Firstly, I created two groups of data, but when I applied Hopkins the results were greater than 15%, this means that the groups selected were not highly clustered.

Consequently, I decided to create three additional groups, and I selected the one with the smallest Hopkins value, hence I selected “X1”, with the Hopkins value of: 0.1134274 (11,34%).

X1 - First group: GDP, T_GDP_R, A_T_G

## # A tibble: 28 x 3
##         GDP T_GDP_R   A_T_G
##       <dbl>   <dbl>   <dbl>
##  1  385712.    83.1  237701
##  2  459820.   169.  1416428
##  3   56087.   107.    29867
##  4   51579.    74.9   11934
##  5   21138.    63.5   32186
##  6  207772.   158.    90526
##  7  301341.    59.5  242068
##  8   26036.   118.    11475
##  9  234453     55.8  196810
## 10 2353090     45.1 2407878
## # … with 18 more rows

[1] 0.8865726

The hopkins value obtained is 11.34%

On the above graph it can be observed that there is one big cluster, and another additional cluster, also one outliner (11) – Germany.

X2 - Second group: GDP, I_TR_EX, E_TR_EX

## # A tibble: 28 x 3
##         GDP I_TR_EX E_TR_EX
##       <dbl>   <dbl>   <dbl>
##  1  385712. 111673.  44756.
##  2  459820. 287689. 107201 
##  3   56087.  19275.   8821.
##  4   51579.  10001.   4749.
##  5   21138.   1250.   3001.
##  6  207772. 144491.  26769.
##  7  301341.  56594.  36013 
##  8   26036.   9814.   4611.
##  9  234453   37884.  26352.
## 10 2353090  290669. 201915.
## # … with 18 more rows

[1] 0.8222338

The hopkins value obtained is 17.77%

X3 - Third group: GDP, I_TR_IM,E_TR_IM

## # A tibble: 28 x 3
##         GDP I_TR_IM E_TR_IM
##       <dbl>   <dbl>   <dbl>
##  1  385712. 127260.  36748.
##  2  459820. 245472. 136329.
##  3   56087.  20403.  11702.
##  4   51579.  18557.   5330.
##  5   21138.   5277.   3890.
##  6  207772. 119732.  36726.
##  7  301341.  60784.  26001 
##  8   26036.  12435.   3794.
##  9  234453   46716.  19861 
## 10 2353090  392438. 175901.
## # … with 18 more rows

## [1] 0.8137182

The hopkins value obtained is 18.62%

X4 - Fourth group: T_TR_I, T_TR_E, A_T_G

## # A tibble: 28 x 3
##     T_TR_I  T_TR_E   A_T_G
##      <dbl>   <dbl>   <dbl>
##  1 238933   81503.  237701
##  2 533162. 243530. 1416428
##  3  39678.  20522.   29867
##  4  28558.  10079.   11934
##  5   6527.   6891.   32186
##  6 264223.  63495    90526
##  7 117378.  62014   242068
##  8  22248.   8404.   11475
##  9  84600.  46213.  196810
## 10 683107  377816  2407878
## # … with 18 more rows

[1] 0.8625978

The hopkins value obtained is 13.74%

X5 - Fifth group: I_TR_EX, E_TR_EX, T_TR_E

## # A tibble: 28 x 3
##    T_GDP_R  T_TR_E   A_T_G
##      <dbl>   <dbl>   <dbl>
##  1    83.1  81503.  237701
##  2   169.  243530. 1416428
##  3   107.   20522.   29867
##  4    74.9  10079.   11934
##  5    63.5   6891.   32186
##  6   158.   63495    90526
##  7    59.5  62014   242068
##  8   118.    8404.   11475
##  9    55.8  46213.  196810
## 10    45.1 377816  2407878
## # … with 18 more rows

[1] 0.8551166

The hopkins value obtained is 14.48%

To go further, I will select X1, as it has the lowest Hopkings value

## $hopkins_stat
## [1] 0.8494199
## 
## $plot

The hopkins value obtained is 84.94%

$hopkins_stat [1] 0.9536594

$plot

The hopkins value obtained is 95.36%

$hopkins_stat [1] 0.9905293

$plot

The hopkins value obtained is 99.05%

$hopkins_stat [1] 0.8645139

$plot

The hopkins value obtained is 86.45%

5) Identification of the `numBER of clusters`:

5.1) `d index`:

*** : The D index is a graphical method of determining the number of clusters. In the plot of D index, we seek a significant knee (the significant peak in Dindex second differences plot) that corresponds to a significant increase of the value of the measure.

$All.index 2 3 4 5 6 7 8 9 538448.59 344488.62 239732.59 184447.65 159562.98 117449.20 95298.51 76051.16 10 45686.66

As per the above graphic, it seems that the optimum is 4 clusters, the same according to the index.

5.2) `Hubert`:

*** : The Hubert index is a graphical method of determining the number of clusters. In the plot of Hubert index, we seek a significant knee that corresponds to a significant increase of the value of the measure i.e the significant peak in Hubert index second differences plot.

$All.index 2 3 4 5 6 7 8 9 10 0 0 0 0 0 0 0 0 0

As per the above graphic, also it seems that the optimum option is to choose 4 clusters.

5.3) `K-means` (silhouette and wss):

Firsty, silhoutte:

Finally, wss:

As a result of this operation it appears that the optimal number of clusters should be 2, also we can consider 3, as there is not a lot of difference between both values

5.4) `PAM` (silhouette and wss):

Firsty, silhoutte:

Finally, wss:

As per the above results, it looks that the optimal number of clusters should be 2, also we can consider 3 and 4, as there is not a lot of difference between these values.

Considering all the results above, I could say that the number of optimal clusters should be 3, because the result of two methods considered the optimal as 4 clusters, and another two methods considered the optimal as 2, hence I believe it would be better to take the average.

6) `Clustering`:

6.1) `K-means`:

cluster size ave.sil.width 1 1 3 0.32 2 2 21 0.82 3 3 4 0.36

The average of silhouette width for K-means is 0.7. Analyzing the position of the data points in the graph, it seems that they matched with their own cluster.

6.2) `PAM`:

cluster size ave.sil.width 1 1 20 0.86 2 2 5 0.21 3 3 3 0.37

The average of silhouette width for PAM is 0.69.

It appears that K-mean fits the data better, hence I will continue using K-means method.

7) `Post-diagnosis`:

    1         2         3

0.5932179 0.2197336 0.5746008

As per the above values, I can conclude that the data is not clustered properly, as the first and the third value are similar.

For the reason above, I will attempt to perform the same analysis considering two clusters with K-means method:

    1         2

0.1823198 0.5727513

As per the above values for two clusters, I can conclude that the data is clustered properly, the values of the clusters are not close.

Therefore, I will proceed once again to perform the analysis of the point 5.1), in this case with two clusters intead of three:

cluster size ave.sil.width 1 1 3 0.55 2 2 25 0.81

The average silhouette is 0.78. Greater than with three clusters.

8) `Results`:

There is a lot of distance in the second cluster, this means that the countries differ the most in the second group.

1. Orange – Big economies:

The first group of countries has high GDP, with a low trade of GDP ratio, but big air transport of goods.

1. Blue – Small economies:

The second group of countries has small GDP, in average it has greater trade GDP than the first group, but lowest air transports of goods than the first group.

9) `Conclusions`:

The method K-Means shows that the optimum division of the data would be in two clusters, considering the following variables: GDP, T_GDP_R and A_T_G.

The two groups have differing characteristics, divided between big countries with big GDP, and small countries with small GDP, with more distance in the first group of big economies. Regarding T_GDP_R the second group has a huge distance in comparison with the first. Finally, for the variable A_T_G, the second group has not a lot of distance, the data are concentrated around the same point.

Clustering applied to the EU trade data, 2018

Clustering applied to the EU trade data, 2018

1) `Project description`:

2) `Dataset description`

3) `Manipulation of the data`:

4) `Pre-diagnosis`:

5) Identification of the `numBER of clusters`:

5.1) `d index`:

5.2) `Hubert`:

5.3) `K-means` (silhouette and wss):

5.4) `PAM` (silhouette and wss):

6) `Clustering`:

6.1) `K-means`:

6.2) `PAM`:

7) `Post-diagnosis`:

8) `Results`:

9) `Conclusions`:

Clustering applied to the EU trade data, 2018

Clustering applied to the EU trade data, 2018

1) Project description:

2) Dataset description

3) Manipulation of the data:

4) Pre-diagnosis:

5) Identification of the numBER of clusters:

5.1) d index:

5.2) Hubert:

5.3) K-means (silhouette and wss):

5.4) PAM (silhouette and wss):

6) Clustering:

6.1) K-means:

6.2) PAM:

7) Post-diagnosis:

8) Results:

9) Conclusions:

1) `Project description`:

2) `Dataset description`

3) `Manipulation of the data`:

4) `Pre-diagnosis`:

5) Identification of the `numBER of clusters`:

5.1) `d index`:

5.2) `Hubert`:

5.3) `K-means` (silhouette and wss):

5.4) `PAM` (silhouette and wss):

6) `Clustering`:

6.1) `K-means`:

6.2) `PAM`:

7) `Post-diagnosis`:

8) `Results`:

9) `Conclusions`: