1) Project description:

The goal of this paper is to analyze the evidence of clustering based on the selected data. To this end, I chose several data from Eurostat-database, in order to find out if there are clusters in term of trade in the European Union (28 countries).

2) Dataset description

Description of the diferent features:

  1. GDP Gross domestic product at market prices (2018, million euro)
  2. I_TR_EX Intra-EU28 trade, exports (2018, millions of ECU/EURO)
  3. I_TR_IM Intra-EU28 trade, imports (2018, millions of ECU/EURO)
  4. E_TR_EX Extra-EU28 trade, exports (2018, millions of ECU/EURO)
  5. E_TR_IM Extra-EU28 trade, imports (2018, millions of ECU/EURO)
  6. T_TR_I Total intra-EU28 trade (2018, millions of ECU/EURO)
  7. T_TR_E Total extra-EU28 trade (2018, millions of ECU/EURO)
  8. T_TR_IM Total imports-EU28 trade (2018, millions of ECU/EURO)
  9. T_TR_EX Total exports-EU28 trade (2018, millions of ECU/EURO)
  10. T_GDP_R Total trade to GDP ratio (%)
  11. A_T_G Air transport of goods (2018, tonnes)
  12. T_TR_I = I_TR_EX + I_TR_IM
  13. T_TR_E = E_TR_EX + E_TR_IM
  14. T_TR_IM = I_TR_IM + E_TR_IM
  15. T_TR_EX = I_TR_EX + E_TR_EX
  16. T_GDP_R = ((T_TR_IM + T_TR_EX) / GDP) * 10

Data bibligraphy:

GDP Eurostat database (tipsau10) - https://ec.europa.eu/eurostat/web/products-datasets/-/tipsau10

I_TR_EX Eurostat database (tet00047) - https://ec.europa.eu/eurostat/web/products-datasets/-/tet00047

I_TR_IM Eurostat database (tet00047) - https://ec.europa.eu/eurostat/web/products-datasets/-/tet00047

E_TR_EX Eurostat database (tet00055) - https://ec.europa.eu/eurostat/web/products-datasets/-/tet00055

E_TR_IM Eurostat database (tet00055) - https://ec.europa.eu/eurostat/web/products-datasets/-/tet00055

A_T_G Eurostat database (ttr00011) – https://ec.europa.eu/eurostat/web/products-datasets/-/ttr00011

3) Manipulation of the data:

  • First we will load all the necesary packages and the data:
## # A tibble: 28 x 6
##    C_EU          GDP I_TR_EX I_TR_IM E_TR_EX E_TR_IM
##    <chr>       <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
##  1 Austria   385712. 111673. 127260.  44756.  36748.
##  2 Belgium   459820. 287689. 245472. 107201  136329.
##  3 Bulgaria   56087.  19275.  20403.   8821.  11702.
##  4 Croatia    51579.  10001.  18557.   4749.   5330.
##  5 Cyprus     21138.   1250.   5277.   3001.   3890.
##  6 Czechia   207772. 144491. 119732.  26769.  36726.
##  7 Denmark   301341.  56594.  60784.  36013   26001 
##  8 Estonia    26036.   9814.  12435.   4611.   3794.
##  9 Finland   234453   37884.  46716.  26352.  19861 
## 10 France   2353090  290669. 392438. 201915. 175901.
## # … with 18 more rows

Finally, the dataset “trade” contains 28 rows, and 12 columns

4) Pre-diagnosis:

Firstly, I created two groups of data, but when I applied Hopkins the results were greater than 15%, this means that the groups selected were not highly clustered.

Consequently, I decided to create three additional groups, and I selected the one with the smallest Hopkins value, hence I selected “X1”, with the Hopkins value of: 0.1134274 (11,34%).

  • X1 - First group: GDP, T_GDP_R, A_T_G
## # A tibble: 28 x 3
##         GDP T_GDP_R   A_T_G
##       <dbl>   <dbl>   <dbl>
##  1  385712.    83.1  237701
##  2  459820.   169.  1416428
##  3   56087.   107.    29867
##  4   51579.    74.9   11934
##  5   21138.    63.5   32186
##  6  207772.   158.    90526
##  7  301341.    59.5  242068
##  8   26036.   118.    11475
##  9  234453     55.8  196810
## 10 2353090     45.1 2407878
## # … with 18 more rows

[1] 0.8865726

The hopkins value obtained is 11.34%

On the above graph it can be observed that there is one big cluster, and another additional cluster, also one outliner (11) – Germany.

  • X2 - Second group: GDP, I_TR_EX, E_TR_EX
## # A tibble: 28 x 3
##         GDP I_TR_EX E_TR_EX
##       <dbl>   <dbl>   <dbl>
##  1  385712. 111673.  44756.
##  2  459820. 287689. 107201 
##  3   56087.  19275.   8821.
##  4   51579.  10001.   4749.
##  5   21138.   1250.   3001.
##  6  207772. 144491.  26769.
##  7  301341.  56594.  36013 
##  8   26036.   9814.   4611.
##  9  234453   37884.  26352.
## 10 2353090  290669. 201915.
## # … with 18 more rows

[1] 0.8222338

The hopkins value obtained is 17.77%

  • X3 - Third group: GDP, I_TR_IM,E_TR_IM
## # A tibble: 28 x 3
##         GDP I_TR_IM E_TR_IM
##       <dbl>   <dbl>   <dbl>
##  1  385712. 127260.  36748.
##  2  459820. 245472. 136329.
##  3   56087.  20403.  11702.
##  4   51579.  18557.   5330.
##  5   21138.   5277.   3890.
##  6  207772. 119732.  36726.
##  7  301341.  60784.  26001 
##  8   26036.  12435.   3794.
##  9  234453   46716.  19861 
## 10 2353090  392438. 175901.
## # … with 18 more rows

## [1] 0.8137182

The hopkins value obtained is 18.62%

  • X4 - Fourth group: T_TR_I, T_TR_E, A_T_G
## # A tibble: 28 x 3
##     T_TR_I  T_TR_E   A_T_G
##      <dbl>   <dbl>   <dbl>
##  1 238933   81503.  237701
##  2 533162. 243530. 1416428
##  3  39678.  20522.   29867
##  4  28558.  10079.   11934
##  5   6527.   6891.   32186
##  6 264223.  63495    90526
##  7 117378.  62014   242068
##  8  22248.   8404.   11475
##  9  84600.  46213.  196810
## 10 683107  377816  2407878
## # … with 18 more rows

[1] 0.8625978

The hopkins value obtained is 13.74%

  • X5 - Fifth group: I_TR_EX, E_TR_EX, T_TR_E
## # A tibble: 28 x 3
##    T_GDP_R  T_TR_E   A_T_G
##      <dbl>   <dbl>   <dbl>
##  1    83.1  81503.  237701
##  2   169.  243530. 1416428
##  3   107.   20522.   29867
##  4    74.9  10079.   11934
##  5    63.5   6891.   32186
##  6   158.   63495    90526
##  7    59.5  62014   242068
##  8   118.    8404.   11475
##  9    55.8  46213.  196810
## 10    45.1 377816  2407878
## # … with 18 more rows

[1] 0.8551166

The hopkins value obtained is 14.48%

To go further, I will select X1, as it has the lowest Hopkings value

## $hopkins_stat
## [1] 0.8494199
## 
## $plot

The hopkins value obtained is 84.94%

$hopkins_stat [1] 0.9536594

$plot

The hopkins value obtained is 95.36%

$hopkins_stat [1] 0.9905293

$plot

The hopkins value obtained is 99.05%

$hopkins_stat [1] 0.8645139

$plot

The hopkins value obtained is 86.45%

5) Identification of the numBER of clusters:

5.1) d index:

*** : The D index is a graphical method of determining the number of clusters. In the plot of D index, we seek a significant knee (the significant peak in Dindex second differences plot) that corresponds to a significant increase of the value of the measure.

$All.index 2 3 4 5 6 7 8 9 538448.59 344488.62 239732.59 184447.65 159562.98 117449.20 95298.51 76051.16 10 45686.66

As per the above graphic, it seems that the optimum is 4 clusters, the same according to the index.

5.2) Hubert:

*** : The Hubert index is a graphical method of determining the number of clusters. In the plot of Hubert index, we seek a significant knee that corresponds to a significant increase of the value of the measure i.e the significant peak in Hubert index second differences plot.

$All.index 2 3 4 5 6 7 8 9 10 0 0 0 0 0 0 0 0 0

As per the above graphic, also it seems that the optimum option is to choose 4 clusters.

5.3) K-means (silhouette and wss):

Firsty, silhoutte:

Finally, wss:

As a result of this operation it appears that the optimal number of clusters should be 2, also we can consider 3, as there is not a lot of difference between both values

5.4) PAM (silhouette and wss):

Firsty, silhoutte:

Finally, wss:

As per the above results, it looks that the optimal number of clusters should be 2, also we can consider 3 and 4, as there is not a lot of difference between these values.

Considering all the results above, I could say that the number of optimal clusters should be 3, because the result of two methods considered the optimal as 4 clusters, and another two methods considered the optimal as 2, hence I believe it would be better to take the average.

6) Clustering:

6.1) K-means:

cluster size ave.sil.width 1 1 3 0.32 2 2 21 0.82 3 3 4 0.36

The average of silhouette width for K-means is 0.7. Analyzing the position of the data points in the graph, it seems that they matched with their own cluster.

6.2) PAM:

cluster size ave.sil.width 1 1 20 0.86 2 2 5 0.21 3 3 3 0.37

The average of silhouette width for PAM is 0.69.

It appears that K-mean fits the data better, hence I will continue using K-means method.

7) Post-diagnosis:

    1         2         3 

0.5932179 0.2197336 0.5746008

As per the above values, I can conclude that the data is not clustered properly, as the first and the third value are similar.

For the reason above, I will attempt to perform the same analysis considering two clusters with K-means method:

    1         2 

0.1823198 0.5727513

As per the above values for two clusters, I can conclude that the data is clustered properly, the values of the clusters are not close.

Therefore, I will proceed once again to perform the analysis of the point 5.1), in this case with two clusters intead of three:

cluster size ave.sil.width 1 1 3 0.55 2 2 25 0.81

The average silhouette is 0.78. Greater than with three clusters.

8) Results:

There is a lot of distance in the second cluster, this means that the countries differ the most in the second group.

    1. Orange – Big economies:

The first group of countries has high GDP, with a low trade of GDP ratio, but big air transport of goods.

    1. Blue – Small economies:

The second group of countries has small GDP, in average it has greater trade GDP than the first group, but lowest air transports of goods than the first group.

9) Conclusions:

The method K-Means shows that the optimum division of the data would be in two clusters, considering the following variables: GDP, T_GDP_R and A_T_G.

The two groups have differing characteristics, divided between big countries with big GDP, and small countries with small GDP, with more distance in the first group of big economies. Regarding T_GDP_R the second group has a huge distance in comparison with the first. Finally, for the variable A_T_G, the second group has not a lot of distance, the data are concentrated around the same point.