Clustering applied to the EU trade data, 2018
1) Project description:
The goal of this paper is to analyze the evidence of clustering based on the selected data. To this end, I chose several data from Eurostat-database, in order to find out if there are clusters in term of trade in the European Union (28 countries).
2) Dataset description
Description of the diferent features:
GDP Gross domestic product at market prices(2018, million euro)I_TR_EX Intra-EU28 trade, exports(2018, millions of ECU/EURO)I_TR_IM Intra-EU28 trade, imports(2018, millions of ECU/EURO)E_TR_EX Extra-EU28 trade, exports(2018, millions of ECU/EURO)E_TR_IM Extra-EU28 trade, imports(2018, millions of ECU/EURO)T_TR_I Total intra-EU28 trade(2018, millions of ECU/EURO)T_TR_E Total extra-EU28 trade(2018, millions of ECU/EURO)T_TR_IM Total imports-EU28 trade(2018, millions of ECU/EURO)T_TR_EX Total exports-EU28 trade(2018, millions of ECU/EURO)T_GDP_R Total trade to GDP ratio(%)A_T_G Air transport of goods(2018, tonnes)T_TR_I= I_TR_EX + I_TR_IMT_TR_E= E_TR_EX + E_TR_IMT_TR_IM= I_TR_IM + E_TR_IMT_TR_EX= I_TR_EX + E_TR_EXT_GDP_R= ((T_TR_IM + T_TR_EX) / GDP) * 10
Data bibligraphy:
GDP Eurostat database (tipsau10) - https://ec.europa.eu/eurostat/web/products-datasets/-/tipsau10
I_TR_EX Eurostat database (tet00047) - https://ec.europa.eu/eurostat/web/products-datasets/-/tet00047
I_TR_IM Eurostat database (tet00047) - https://ec.europa.eu/eurostat/web/products-datasets/-/tet00047
E_TR_EX Eurostat database (tet00055) - https://ec.europa.eu/eurostat/web/products-datasets/-/tet00055
E_TR_IM Eurostat database (tet00055) - https://ec.europa.eu/eurostat/web/products-datasets/-/tet00055
A_T_G Eurostat database (ttr00011) – https://ec.europa.eu/eurostat/web/products-datasets/-/ttr00011
3) Manipulation of the data:
- First we will load all the necesary
packagesand thedata:
library(cluster)
library(factoextra)
library(flexclust)
library(fpc)
library(clustertend)
library(ClusterR)
library(NbClust)
library(readxl)
library(reshape)
library(psych)
library(dplyr)## # A tibble: 28 x 6
## C_EU GDP I_TR_EX I_TR_IM E_TR_EX E_TR_IM
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Austria 385712. 111673. 127260. 44756. 36748.
## 2 Belgium 459820. 287689. 245472. 107201 136329.
## 3 Bulgaria 56087. 19275. 20403. 8821. 11702.
## 4 Croatia 51579. 10001. 18557. 4749. 5330.
## 5 Cyprus 21138. 1250. 5277. 3001. 3890.
## 6 Czechia 207772. 144491. 119732. 26769. 36726.
## 7 Denmark 301341. 56594. 60784. 36013 26001
## 8 Estonia 26036. 9814. 12435. 4611. 3794.
## 9 Finland 234453 37884. 46716. 26352. 19861
## 10 France 2353090 290669. 392438. 201915. 175901.
## # … with 18 more rows
Finally, the dataset “trade” contains 28 rows, and 12 columns
4) Pre-diagnosis:
Firstly, I created two groups of data, but when I applied Hopkins the results were greater than 15%, this means that the groups selected were not highly clustered.
Consequently, I decided to create three additional groups, and I selected the one with the smallest Hopkins value, hence I selected “X1”, with the Hopkins value of: 0.1134274 (11,34%).
X1- First group: GDP, T_GDP_R, A_T_G
## # A tibble: 28 x 3
## GDP T_GDP_R A_T_G
## <dbl> <dbl> <dbl>
## 1 385712. 83.1 237701
## 2 459820. 169. 1416428
## 3 56087. 107. 29867
## 4 51579. 74.9 11934
## 5 21138. 63.5 32186
## 6 207772. 158. 90526
## 7 301341. 59.5 242068
## 8 26036. 118. 11475
## 9 234453 55.8 196810
## 10 2353090 45.1 2407878
## # … with 18 more rows
[1] 0.8865726
The hopkins value obtained is 11.34%
On the above graph it can be observed that there is one big cluster, and another additional cluster, also one outliner (11) – Germany.
X2- Second group: GDP, I_TR_EX, E_TR_EX
## # A tibble: 28 x 3
## GDP I_TR_EX E_TR_EX
## <dbl> <dbl> <dbl>
## 1 385712. 111673. 44756.
## 2 459820. 287689. 107201
## 3 56087. 19275. 8821.
## 4 51579. 10001. 4749.
## 5 21138. 1250. 3001.
## 6 207772. 144491. 26769.
## 7 301341. 56594. 36013
## 8 26036. 9814. 4611.
## 9 234453 37884. 26352.
## 10 2353090 290669. 201915.
## # … with 18 more rows
[1] 0.8222338
The hopkins value obtained is 17.77%
X3- Third group: GDP, I_TR_IM,E_TR_IM
## # A tibble: 28 x 3
## GDP I_TR_IM E_TR_IM
## <dbl> <dbl> <dbl>
## 1 385712. 127260. 36748.
## 2 459820. 245472. 136329.
## 3 56087. 20403. 11702.
## 4 51579. 18557. 5330.
## 5 21138. 5277. 3890.
## 6 207772. 119732. 36726.
## 7 301341. 60784. 26001
## 8 26036. 12435. 3794.
## 9 234453 46716. 19861
## 10 2353090 392438. 175901.
## # … with 18 more rows
## [1] 0.8137182
The hopkins value obtained is 18.62%
X4- Fourth group: T_TR_I, T_TR_E, A_T_G
## # A tibble: 28 x 3
## T_TR_I T_TR_E A_T_G
## <dbl> <dbl> <dbl>
## 1 238933 81503. 237701
## 2 533162. 243530. 1416428
## 3 39678. 20522. 29867
## 4 28558. 10079. 11934
## 5 6527. 6891. 32186
## 6 264223. 63495 90526
## 7 117378. 62014 242068
## 8 22248. 8404. 11475
## 9 84600. 46213. 196810
## 10 683107 377816 2407878
## # … with 18 more rows
[1] 0.8625978
The hopkins value obtained is 13.74%
X5- Fifth group: I_TR_EX, E_TR_EX, T_TR_E
## # A tibble: 28 x 3
## T_GDP_R T_TR_E A_T_G
## <dbl> <dbl> <dbl>
## 1 83.1 81503. 237701
## 2 169. 243530. 1416428
## 3 107. 20522. 29867
## 4 74.9 10079. 11934
## 5 63.5 6891. 32186
## 6 158. 63495 90526
## 7 59.5 62014 242068
## 8 118. 8404. 11475
## 9 55.8 46213. 196810
## 10 45.1 377816 2407878
## # … with 18 more rows
[1] 0.8551166
The hopkins value obtained is 14.48%
To go further, I will select X1, as it has the lowest Hopkings value
## $hopkins_stat
## [1] 0.8494199
##
## $plot
The hopkins value obtained is 84.94%
$hopkins_stat [1] 0.9536594
$plot
The hopkins value obtained is 95.36%
$hopkins_stat [1] 0.9905293
$plot
The hopkins value obtained is 99.05%
$hopkins_stat [1] 0.8645139
$plot
The hopkins value obtained is 86.45%
5) Identification of the numBER of clusters:
5.1) d index:
*** : The D index is a graphical method of determining the number of clusters. In the plot of D index, we seek a significant knee (the significant peak in Dindex second differences plot) that corresponds to a significant increase of the value of the measure.
$All.index 2 3 4 5 6 7 8 9 538448.59 344488.62 239732.59 184447.65 159562.98 117449.20 95298.51 76051.16 10 45686.66
As per the above graphic, it seems that the optimum is 4 clusters, the same according to the index.
5.2) Hubert:
*** : The Hubert index is a graphical method of determining the number of clusters. In the plot of Hubert index, we seek a significant knee that corresponds to a significant increase of the value of the measure i.e the significant peak in Hubert index second differences plot.
$All.index 2 3 4 5 6 7 8 9 10 0 0 0 0 0 0 0 0 0
As per the above graphic, also it seems that the optimum option is to choose 4 clusters.
5.3) K-means (silhouette and wss):
Firsty, silhoutte:
Finally, wss:
As a result of this operation it appears that the optimal number of clusters should be 2, also we can consider 3, as there is not a lot of difference between both values
5.4) PAM (silhouette and wss):
Firsty, silhoutte:
Finally, wss:
As per the above results, it looks that the optimal number of clusters should be 2, also we can consider 3 and 4, as there is not a lot of difference between these values.
Considering all the results above, I could say that the number of optimal clusters should be 3, because the result of two methods considered the optimal as 4 clusters, and another two methods considered the optimal as 2, hence I believe it would be better to take the average.
6) Clustering:
6.1) K-means:
cluster size ave.sil.width 1 1 3 0.32 2 2 21 0.82 3 3 4 0.36
The average of silhouette width for K-means is 0.7. Analyzing the position of the data points in the graph, it seems that they matched with their own cluster.
6.2) PAM:
cluster size ave.sil.width 1 1 20 0.86 2 2 5 0.21 3 3 3 0.37
The average of silhouette width for PAM is 0.69.
It appears that K-mean fits the data better, hence I will continue using K-means method.
7) Post-diagnosis:
1 2 3
0.5932179 0.2197336 0.5746008
As per the above values, I can conclude that the data is not clustered properly, as the first and the third value are similar.
For the reason above, I will attempt to perform the same analysis considering two clusters with K-means method:
1 2
0.1823198 0.5727513
As per the above values for two clusters, I can conclude that the data is clustered properly, the values of the clusters are not close.
Therefore, I will proceed once again to perform the analysis of the point 5.1), in this case with two clusters intead of three:
cluster size ave.sil.width 1 1 3 0.55 2 2 25 0.81
The average silhouette is 0.78. Greater than with three clusters.
8) Results:
There is a lot of distance in the second cluster, this means that the countries differ the most in the second group.
- Orange –
Big economies:
- Orange –
The first group of countries has high GDP, with a low trade of GDP ratio, but big air transport of goods.
- Blue –
Small economies:
- Blue –
The second group of countries has small GDP, in average it has greater trade GDP than the first group, but lowest air transports of goods than the first group.
9) Conclusions:
The method K-Means shows that the optimum division of the data would be in two clusters, considering the following variables: GDP, T_GDP_R and A_T_G.
The two groups have differing characteristics, divided between big countries with big GDP, and small countries with small GDP, with more distance in the first group of big economies. Regarding T_GDP_R the second group has a huge distance in comparison with the first. Finally, for the variable A_T_G, the second group has not a lot of distance, the data are concentrated around the same point.