Dimensionality Reduction applied to the EU trade data, 2018
1) Project description:
The goal of this paper is to apply dimensionality reduction techniques to the selected data. To this end, I chose several data from Eurostat-database, in order to discover if it is possible to reduce features without lossing quality of the following dataset: trade in the European Union (28 countries) during 2018
Last but not least, in the extensions we will see how dimensionality reduction through image compression can be applied
2) Dataset description
Description of the diferent features:
GDP Gross domestic product at market prices(2018, million euro)I_TR_EX Intra-EU28 trade, exports(2018, millions of ECU/EURO)I_TR_IM Intra-EU28 trade, imports(2018, millions of ECU/EURO)E_TR_EX Extra-EU28 trade, exports(2018, millions of ECU/EURO)E_TR_IM Extra-EU28 trade, imports(2018, millions of ECU/EURO)T_TR_I Total intra-EU28 trade(2018, millions of ECU/EURO)T_TR_E Total extra-EU28 trade(2018, millions of ECU/EURO)T_TR_IM Total imports-EU28 trade(2018, millions of ECU/EURO)T_TR_EX Total exports-EU28 trade(2018, millions of ECU/EURO)T_GDP_R Total trade to GDP ratio(%)A_T_G Air transport of goods(2018, tonnes)T_TR_I= I_TR_EX + I_TR_IMT_TR_E= E_TR_EX + E_TR_IMT_TR_IM= I_TR_IM + E_TR_IMT_TR_EX= I_TR_EX + E_TR_EXT_GDP_R= ((T_TR_IM + T_TR_EX) / GDP) * 10
Data bibligraphy:
GDP Eurostat database (tipsau10) - https://ec.europa.eu/eurostat/web/products-datasets/-/tipsau10
I_TR_EX Eurostat database (tet00047) - https://ec.europa.eu/eurostat/web/products-datasets/-/tet00047
I_TR_IM Eurostat database (tet00047) - https://ec.europa.eu/eurostat/web/products-datasets/-/tet00047
E_TR_EX Eurostat database (tet00055) - https://ec.europa.eu/eurostat/web/products-datasets/-/tet00055
E_TR_IM Eurostat database (tet00055) - https://ec.europa.eu/eurostat/web/products-datasets/-/tet00055
A_T_G Eurostat database (ttr00011) – https://ec.europa.eu/eurostat/web/products-datasets/-/ttr00011
3) Manipulation of the data:
- First we will load all the necesary
packagesand thedata:
library(readxl)
library(corrplot)
library(ggplot2)
library(GGally)
library(smacof)
library(labdsv)
library(vegan)
library(MASS)
library(ape)
library(ggfortify)
library(FactoMineR)
library(factoextra)
library(pca3d)
library(pls)
library(ClusterR)
library(ggrepel)
library(MVN)
library(clusterSim)
library(dimRed)
library(fastICA)
library(umap)
library(ica)# A tibble: 28 x 8
C_EU GDP I_TR_EX I_TR_IM E_TR_EX E_TR_IM A_T_G M_T_G
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Austria 385712. 111673. 127260. 44756. 36748. 237701 0
2 Belgium 459820. 287689. 245472. 107201 136329. 1416428 270317
3 Bulgaria 56087. 19275. 20403. 8821. 11702. 29867 27868
4 Croatia 51579. 10001. 18557. 4749. 5330. 11934 21573
5 Cyprus 21138. 1250. 5277. 3001. 3890. 32186 6948
6 Czechia 207772. 144491. 119732. 26769. 36726. 90526 0
7 Denmark 301341. 56594. 60784. 36013 26001 242068 95835
8 Estonia 26036. 9814. 12435. 4611. 3794. 11475 35947
9 Finland 234453 37884. 46716. 26352. 19861 196810 116764
10 France 2353090 290669. 392438. 201915. 175901. 2407878 308629
# … with 18 more rows
C_EU GDP I_TR_EX I_TR_IM
Length:28 Min. : 12324 Min. : 1250 Min. : 3867
Class :character 1st Qu.: 54960 1st Qu.: 17411 1st Qu.: 21085
Mode :character Median : 205834 Median : 62119 Median : 61327
Mean : 567906 Mean :125855 Mean :123237
3rd Qu.: 477487 3rd Qu.:181395 3rd Qu.:165952
Max. :3344370 Max. :778747 Max. :722546
E_TR_EX E_TR_IM A_T_G M_T_G
Min. : 1081 Min. : 1491 Min. : 11475 Min. : 0
1st Qu.: 8871 1st Qu.: 11184 1st Qu.: 28557 1st Qu.: 17917
Median : 22850 Median : 25792 Median : 157448 Median : 58458
Mean : 69923 Mean : 70741 Mean : 638555 Mean :145942
3rd Qu.: 76874 3rd Qu.: 85964 3rd Qu.: 828640 3rd Qu.:210472
Max. :541985 Max. :364885 Max. :4842716 Max. :604542
As we can see above, our data frame “trade” has 28 rows and 8 rows, also we can see the main statistics.
A new data frame x.trade.df was created, in order to have only the selected features, we will obtain a data frame with 28 rows and 7 columns:
The dataset and the features selected are the same as the project clustering, but in this case, it was included M_T_G, this is the sea transport of goods.
I included this feature because I know that it is not highly correlated with the rest of features, as the others depend on how big the country is, but the sea transport of goods also depends on the access that the country has to the sea. As we can see, there are for example countries with zero values, the contrary to A_T_G (air transport of goods), as all the countries are well developed, and they have access to air transport.
At this point we are ready to begin with dimensionality reduction, after maybe we will have the need to standardize or normalized the data frame, but it would be applied in the corresponding method.
4) Correlation between variables:
Despite the fact that the package corrplot was installed in the beginning, several times we need to deactivate and activate once again, hence I need to run the code above in order to run a fancy correlation plot.
GDP I_TR_EX I_TR_IM E_TR_EX E_TR_IM A_T_G M_T_G
GDP 1.00 0.82 0.93 0.94 0.89 0.92 0.69
I_TR_EX 0.82 1.00 0.95 0.93 0.92 0.89 0.65
I_TR_IM 0.93 0.95 1.00 0.97 0.91 0.95 0.63
E_TR_EX 0.94 0.93 0.97 1.00 0.91 0.95 0.63
E_TR_IM 0.89 0.92 0.91 0.91 1.00 0.92 0.83
A_T_G 0.92 0.89 0.95 0.95 0.92 1.00 0.63
M_T_G 0.69 0.65 0.63 0.63 0.83 0.63 1.00
As we can see in the graph and on the table above, all the variables are highly correlated, but M_T_G (sea transport of good) has lower correlation between variables than the rest of the data, hence the previous decision of incorporation this variable has sense, in order not to have solely highly correlation.
Due to the fact that all the variables are highly correlated, a priori, it looks like that some features can be reduced, without losing a lot of information.
Additionally it is possible to run other kind of graphs to check the correlation, as the one below:
5) MDS - Multidimensional Scaling:
I will apply the dimensionality reduction method MDS.
Fist of all one should standardize the variables:
GDP I_TR_EX I_TR_IM E_TR_EX E_TR_IM A_T_G
[1,] -0.2117651 -0.08321239 0.02573900 -0.2214649 -0.3418405 -0.3587276
[2,] -0.1256292 0.94957815 0.78205075 0.3280349 0.6595468 0.6961247
[3,] -0.5948892 -0.62536579 -0.65792005 -0.5376829 -0.5937038 -0.5447199
[4,] -0.6001286 -0.67978063 -0.66973058 -0.5735100 -0.6577816 -0.5607683
[5,] -0.6355106 -0.73112789 -0.75469669 -0.5888910 -0.6722612 -0.5426446
[6,] -0.4185847 0.10935070 -0.02242635 -0.3797415 -0.3420598 -0.4904357
M_T_G
[1,] -0.7978678
[2,] 0.6799595
[3,] -0.6455130
[4,] -0.6799278
[5,] -0.7598829
[6,] -0.7978678
GDP I_TR_EX I_TR_IM E_TR_EX
Min. :-0.6458 Min. :-0.7311 Min. :-0.7637 Min. :-0.60579
1st Qu.:-0.5962 1st Qu.:-0.6363 1st Qu.:-0.6536 1st Qu.:-0.53724
Median :-0.4208 Median :-0.3740 Median :-0.3961 Median :-0.41422
Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.00000
3rd Qu.:-0.1051 3rd Qu.: 0.3259 3rd Qu.: 0.2733 3rd Qu.: 0.06116
Max. : 3.2271 Max. : 3.8309 Max. : 3.8343 Max. : 4.15400
E_TR_IM A_T_G M_T_G
Min. :-0.6964 Min. :-0.5612 Min. :-0.7979
1st Qu.:-0.5989 1st Qu.:-0.5459 1st Qu.:-0.6999
Median :-0.4520 Median :-0.4305 Median :-0.4783
Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
3rd Qu.: 0.1531 3rd Qu.: 0.1701 3rd Qu.: 0.3528
Max. : 2.9579 Max. : 3.7623 Max. : 2.5072
As we can find above, the distribution of data was standardized with mean 0 for all the variables, also with a maximun value of 4.154 and minimun of -0.7979.
We compute the distance matrix of our standardized variables:
1 2 3 4 5 6 7
1 0.0000000 2.499499 1.06270323 1.12644689 1.2197880 0.3529495 0.7662019
2 2.4994992 0.000000 3.22348353 3.31131988 3.4159194 2.5563745 2.6280262
8 9 10 11 12 13 14
1 1.17723988 0.9735547 4.438445 9.554553 1.3917886 0.5348002 0.6330087
2 3.30898281 2.7582468 2.722305 7.628977 2.8304836 2.8438122 2.6342803
15 16 17 18 19 20 21
1 3.990972 1.2078687 1.0989429 1.2580809 1.24185007 5.079678 0.7499864
2 2.209081 3.2504548 3.1837662 3.1665807 3.44822826 2.726870 1.9314882
22 23 24 25 26 27 28
1 0.8716965 0.7323349 0.7218674 1.02444657 3.295780 1.026152 5.184801
2 2.8063779 2.8892734 3.0213397 3.20528380 1.816859 2.207987 3.336532
[ reached getOption("max.print") -- omitted 4 rows ]
Analyzing the above table, we obtain the result that the country 11 is the most distinct one, we will check which country has the value 11:
# A tibble: 1 x 1
C_EU
<chr>
1 Germany
Germany is the country number 11.
We are ready to perform the multidimensional scaling of our distance matrix with the function cmdscale:
We will plot the previous results, using the function ggplot2. Accordingly we obtain a solution of multidimensional scaling with two dimensions:
With this graph we can confirm that the country Germany (11) has the most distinct value.
Subsequently we will check the eigen values, also we will compute the goodness of fit of the solution:
[1] 165.7582 14.6403 5.2051 2.0171 0.7959 0.5107 0.0727 0.0000
[9] 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
[17] 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
[25] 0.0000 0.0000 0.0000 0.0000
[1] 0.9544893 0.9544893
We obtain 0.9544 as the goodness of fit. It is an elevate value, hence we can conclude that the data was fitted really well.
We will continue with the fitted distances:
In the next plot we will be able to check the fitted distances against the observed distances, also we will plot a red line, as regression line:
As we can observe in the graph above, the red line of regression goes linearly respect to the fitted and observed distances, this can prove that there is a strong correlation between the fitted and observed distances, which was our initial target. Apart from that we can tell that it gives a really good goodness of fit, as the regression line has almost 45 degrees.
[1] "Coefficient of determination: 0.995639837782487"
The coefficient of determination confirms as well, that we have fitted the data very well, with high value of 0.9956.
6) Non-metric MDS:
In this case, contrary to the previous technique (MDS), it finds a non-parametric monotonic relationship between the dissimilarities in the item matrix and the Euclidean distances between items, and the location of each item in the low-dimensional space.
initial value 41.066481
iter 5 value 38.587966
iter 10 value 34.783267
iter 15 value 29.682143
iter 20 value 26.171888
iter 25 value 23.048600
iter 30 value 20.836697
iter 35 value 19.520166
iter 40 value 18.740493
iter 45 value 18.091546
iter 50 value 17.573080
iter 55 value 17.134058
iter 60 value 16.718544
iter 65 value 16.223233
iter 70 value 15.590125
iter 75 value 14.475025
iter 80 value 12.664584
iter 85 value 9.133363
iter 90 value 7.042891
iter 95 value 5.935541
iter 100 value 5.064410
final value 5.064410
stopped after 100 iterations
Above we obtain the plot of two dimensional solutions.
Run 0 stress 0.01738228
Run 1 stress 0.02665436
Run 2 stress 0.02568885
Run 3 stress 0.01848068
Run 4 stress 0.02245584
Run 5 stress 0.02078068
Run 6 stress 0.01832534
Run 7 stress 0.0215735
Run 8 stress 0.02260607
Run 9 stress 0.01827988
Run 10 stress 0.01757911
... Procrustes: rmse 0.008121119 max resid 0.03040519
Run 11 stress 0.02458097
Run 12 stress 0.01837971
Run 13 stress 0.02457995
Run 14 stress 0.03100273
Run 15 stress 0.02121632
Run 16 stress 0.01738548
... Procrustes: rmse 0.005244945 max resid 0.01942814
Run 17 stress 0.02076053
Run 18 stress 0.02072885
Run 19 stress 0.01756905
... Procrustes: rmse 0.008072044 max resid 0.03022238
Run 20 stress 0.01827637
*** No convergence -- monoMDS stopping criteria:
15: no. of iterations >= maxit
5: stress ratio > sratmax
Above we have the results of the stress (or goodness of fit), we should minimize it in order to obtain the MDS solution.
Now we will graphically check the stress:
We obtain a coefficient of determination of 0.9967, we can also see in the above graph that in almost all cases the fitted distances are perfectly related with observed distances.
Now we will compute the stress for all the dimensions, and we will explain how many dimensions are required in order to obtain a good fit:
Run 0 stress 0.05281717
Run 1 stress 0.2217868
Run 2 stress 0.1732915
Run 3 stress 0.05282769
... Procrustes: rmse 0.00106163 max resid 0.004544145
... Similar to previous best
Run 4 stress 0.05282506
... Procrustes: rmse 0.001037201 max resid 0.004515597
... Similar to previous best
Run 5 stress 0.173177
Run 6 stress 0.05307755
... Procrustes: rmse 0.002012024 max resid 0.008901526
... Similar to previous best
Run 7 stress 0.5560847
Run 8 stress 0.5321585
Run 9 stress 0.1863158
Run 10 stress 0.2336482
Run 11 stress 0.2234388
Run 12 stress 0.1504285
Run 13 stress 0.05282555
... Procrustes: rmse 0.0004841585 max resid 0.001815399
... Similar to previous best
Run 14 stress 0.2361418
Run 15 stress 0.5562634
Run 16 stress 0.2289602
Run 17 stress 0.1731504
Run 18 stress 0.552298
Run 19 stress 0.185972
Run 20 stress 0.1736204
*** Solution reached
Run 0 stress 0.01738228
Run 1 stress 0.03100088
Run 2 stress 0.02111841
Run 3 stress 0.02150661
Run 4 stress 0.01849076
Run 5 stress 0.02246308
Run 6 stress 0.02420226
Run 7 stress 0.01781484
... Procrustes: rmse 0.01772445 max resid 0.06540565
Run 8 stress 0.01766881
... Procrustes: rmse 0.01021305 max resid 0.03832584
Run 9 stress 0.3894621
Run 10 stress 0.03133328
Run 11 stress 0.02113734
Run 12 stress 0.01827264
Run 13 stress 0.01826271
Run 14 stress 0.01996776
Run 15 stress 0.01763289
... Procrustes: rmse 0.01498576 max resid 0.05528703
Run 16 stress 0.02457944
Run 17 stress 0.03133148
Run 18 stress 0.01891729
Run 19 stress 0.01741122
... Procrustes: rmse 0.002271449 max resid 0.00845779
... Similar to previous best
Run 20 stress 0.01740315
... Procrustes: rmse 0.001710116 max resid 0.006362149
... Similar to previous best
*** Solution reached
Run 0 stress 0.004922049
Run 1 stress 0.005034798
... Procrustes: rmse 0.01212265 max resid 0.03559764
Run 2 stress 0.005159033
... Procrustes: rmse 0.01226255 max resid 0.0440695
Run 3 stress 0.006996777
Run 4 stress 0.01075896
Run 5 stress 0.007540382
Run 6 stress 0.006991977
Run 7 stress 0.007230715
Run 8 stress 0.00777392
Run 9 stress 0.008956499
Run 10 stress 0.009431452
Run 11 stress 0.006409408
Run 12 stress 0.006398254
Run 13 stress 0.01092658
Run 14 stress 0.005068423
... Procrustes: rmse 0.009942255 max resid 0.03492989
Run 15 stress 0.01536551
Run 16 stress 0.008970558
Run 17 stress 0.008161329
Run 18 stress 0.01189896
Run 19 stress 0.006747112
Run 20 stress 0.007377814
*** No convergence -- monoMDS stopping criteria:
20: no. of iterations >= maxit
Run 0 stress 0.0005684983
Run 1 stress 0.005124519
Run 2 stress 0.00313353
Run 3 stress 0.003063887
Run 4 stress 0.001130098
Run 5 stress 0.002415334
Run 6 stress 0.003542343
Run 7 stress 0.008225808
Run 8 stress 0.003611772
Run 9 stress 0.004926459
Run 10 stress 0.002758216
Run 11 stress 0.0009998899
... Procrustes: rmse 0.01340251 max resid 0.03291724
Run 12 stress 0.00450292
Run 13 stress 0.004131073
Run 14 stress 0.004653221
Run 15 stress 0.004779995
Run 16 stress 0.003550253
Run 17 stress 0.005096152
Run 18 stress 0.004022344
Run 19 stress 0.00287027
Run 20 stress 0.001638391
*** No convergence -- monoMDS stopping criteria:
20: no. of iterations >= maxit
Run 0 stress 9.951893e-05
Run 1 stress 0.0008508154
Run 2 stress 0.002194416
Run 3 stress 0.004169654
Run 4 stress 0.002355673
Run 5 stress 0.002029783
Run 6 stress 0.002243723
Run 7 stress 0.002041155
Run 8 stress 0.00221081
Run 9 stress 0.001388282
Run 10 stress 0.001669469
Run 11 stress 0.002571836
Run 12 stress 0.0012516
Run 13 stress 0.0007910094
Run 14 stress 0.001337923
Run 15 stress 0.001331504
Run 16 stress 0.003299361
Run 17 stress 0.00172367
Run 18 stress 0.002965277
Run 19 stress 0.003053235
Run 20 stress 0.002452937
*** No convergence -- monoMDS stopping criteria:
20: no. of iterations >= maxit
Run 0 stress 6.497191e-05
Run 1 stress 0.000996892
Run 2 stress 0.002420061
Run 3 stress 0.003097316
Run 4 stress 0.002399856
Run 5 stress 0.002591253
Run 6 stress 0.001428916
Run 7 stress 0.001944264
Run 8 stress 0.0006884085
Run 9 stress 0.001526541
Run 10 stress 0.00216325
Run 11 stress 0.00191342
Run 12 stress 0.001818719
Run 13 stress 0.002417233
Run 14 stress 0.001162766
Run 15 stress 0.002723075
Run 16 stress 0.002553233
Run 17 stress 0.001472004
Run 18 stress 0.002449874
Run 19 stress 0.001763495
Run 20 stress 0.001525711
*** No convergence -- monoMDS stopping criteria:
20: no. of iterations >= maxit
Run 0 stress 0
Run 1 stress 0.001370632
Run 2 stress 0.0009133847
Run 3 stress 0.0007920234
Run 4 stress 0.0009817165
Run 5 stress 0.001604746
Run 6 stress 0.001329391
Run 7 stress 0.001881362
Run 8 stress 0.001098428
Run 9 stress 0.00199932
Run 10 stress 0.002371378
Run 11 stress 0.001692948
Run 12 stress 0.002828057
Run 13 stress 0.00116458
Run 14 stress 0.003370393
Run 15 stress 0.003534193
Run 16 stress 0.002540077
Run 17 stress 0.00100504
Run 18 stress 0.001356418
Run 19 stress 0.000723162
Run 20 stress 0.002597001
*** No convergence -- monoMDS stopping criteria:
20: no. of iterations >= maxit
Run 0 stress 0
Run 1 stress 0.002580288
Run 2 stress 0.0009090862
Run 3 stress 0.00157642
Run 4 stress 0.001814163
Run 5 stress 0.001873632
Run 6 stress 0.001450652
Run 7 stress 0.002143105
Run 8 stress 0.002803625
Run 9 stress 0.0008657835
Run 10 stress 0.003171441
Run 11 stress 0.0007095645
Run 12 stress 0.002457728
Run 13 stress 0.003629506
Run 14 stress 0.002011247
Run 15 stress 0.00139924
Run 16 stress 0.000921801
Run 17 stress 0.002123891
Run 18 stress 0.002381936
Run 19 stress 0.003486536
Run 20 stress 0.003189123
*** No convergence -- monoMDS stopping criteria:
20: no. of iterations >= maxit
Run 0 stress 0
Run 1 stress 0.002310324
Run 2 stress 0.001242967
Run 3 stress 0.0008261568
Run 4 stress 0.001427361
Run 5 stress 0.002581567
Run 6 stress 0.001848706
Run 7 stress 0.001797331
Run 8 stress 0.0007404808
Run 9 stress 0.002163955
Run 10 stress 0.001756932
Run 11 stress 0.001823287
Run 12 stress 0.001849845
Run 13 stress 0.001394392
Run 14 stress 0.00248491
Run 15 stress 0.002130265
Run 16 stress 0.001155515
Run 17 stress 0.001404284
Run 18 stress 0.002026062
Run 19 stress 0.001522932
Run 20 stress 0.001226192
*** No convergence -- monoMDS stopping criteria:
20: no. of iterations >= maxit
Run 0 stress 0
Run 1 stress 0.002686181
Run 2 stress 0.001089588
Run 3 stress 0.002984528
Run 4 stress 0.002780958
Run 5 stress 0.003116096
Run 6 stress 0.003322968
Run 7 stress 0.004951522
Run 8 stress 0.001665136
Run 9 stress 0.002437215
Run 10 stress 0.001540673
Run 11 stress 0.001362519
Run 12 stress 0.001574775
Run 13 stress 0.002798844
Run 14 stress 0.001203459
Run 15 stress 0.001295739
Run 16 stress 0.00204832
Run 17 stress 0.001385791
Run 18 stress 0.001410418
Run 19 stress 0.001655983
Run 20 stress 0.001658941
*** No convergence -- monoMDS stopping criteria:
20: no. of iterations >= maxit
Graph of the above stress analysis:
As we can find above, if we consider 2% as the maximum stress acceptable (this low value, because the data is highly correlated). Therefore, we should use two variables, obtaining a little bit less than 2% of stress.
7) PCA - Principal Component Analysis:
Now we will apply the lineal dimensionality reduction method PCA, which minimized the variances.
Firstly, we will normalize the data, as it is the fist step before applying PCA:
GDP I_TR_EX I_TR_IM E_TR_EX
Min. :-0.6458 Min. :-0.7311 Min. :-0.7637 Min. :-0.60579
1st Qu.:-0.5962 1st Qu.:-0.6363 1st Qu.:-0.6536 1st Qu.:-0.53724
Median :-0.4208 Median :-0.3740 Median :-0.3961 Median :-0.41422
Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.00000
3rd Qu.:-0.1051 3rd Qu.: 0.3259 3rd Qu.: 0.2733 3rd Qu.: 0.06116
Max. : 3.2271 Max. : 3.8309 Max. : 3.8343 Max. : 4.15400
E_TR_IM A_T_G M_T_G
Min. :-0.6964 Min. :-0.5612 Min. :-0.7979
1st Qu.:-0.5989 1st Qu.:-0.5459 1st Qu.:-0.6999
Median :-0.4520 Median :-0.4305 Median :-0.4783
Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
3rd Qu.: 0.1531 3rd Qu.: 0.1701 3rd Qu.: 0.3528
Max. : 2.9579 Max. : 3.7623 Max. : 2.5072
Once the data has been normalized, we can apply PCA:
Above we can find the eigenvalues for each dimension. As we can see, the first variable can contain almost 80% of the information, because there is high correlation between the features.
Plot of the columns for eignvalue for each dimension:
Plot of the columns for eignvalue for each variable:
Plot for columns for each of the dimensions with the vectors:
Representation of the importance of each variable:
As we can see above, the least importance variable is M_T_G (sea transport of goods), because it is less correlated than the others, as we see in the previous steps.
We finished with the graphical results of PCA, now we can see the numerical results:
eigenvalue variance.percent cumulative.variance.percent
Dim.1 6.13919303 87.70275750 87.70276
Dim.2 0.54223180 7.74616851 95.44893
Dim.3 0.19278247 2.75403532 98.20296
Dim.4 0.07470699 1.06724273 99.27020
Dim.5 0.02947691 0.42109872 99.69130
Dim.6 0.01891505 0.27021507 99.96152
Dim.7 0.00269375 0.03848215 100.00000
Above we have the table with the eigenvalue for each feature.
We can obtain the results for the variables that we used in this analysis, also the coordinates of the variables:
Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6
GDP 0.9474782 -0.04856201 0.30064474 -0.08518174 0.0247928788 -0.03348593
I_TR_EX 0.9455461 -0.10920569 -0.29845573 -0.06257200 0.0085093814 0.01098362
I_TR_IM 0.9745411 -0.17794242 -0.02245772 -0.09145207 0.0908932163 0.02214732
E_TR_EX 0.9717463 -0.18031725 0.03124067 -0.04247943 -0.1412986982 0.01652801
E_TR_IM 0.9744126 0.15329370 -0.08532839 0.08934757 -0.0007241531 -0.10790854
A_T_G 0.9618374 -0.14905343 0.06672130 0.21139098 0.0230421281 0.05428960
M_T_G 0.7607814 0.64657091 0.01031479 -0.02643277 -0.0056081276 0.04814369
Dim.7
GDP -0.023402168
I_TR_EX -0.028854475
I_TR_IM 0.031405006
E_TR_EX 0.013256981
E_TR_IM 0.010605346
A_T_G -0.005975913
M_T_G 0.001816897
Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6
GDP 14.62269 0.434919 46.88562037 9.7125156 2.085316364 5.9281235
I_TR_EX 14.56311 2.199407 46.20535459 5.2408155 0.245648441 0.6377984
I_TR_IM 15.46995 5.839478 0.26161560 11.1950456 28.027281844 2.5931925
E_TR_EX 15.38135 5.996386 0.50625938 2.4154388 67.732071661 1.4442209
E_TR_IM 15.46587 4.333748 3.77676143 10.6857301 0.001779012 61.5607625
A_T_G 15.06926 4.097311 2.30919956 59.8152120 1.801205283 15.5820903
M_T_G 9.42776 77.098752 0.05518908 0.9352424 0.106697395 12.2538121
Dim.7
GDP 20.3308197
I_TR_EX 30.9078667
I_TR_IM 36.6134329
E_TR_EX 6.5242702
E_TR_IM 4.1753452
A_T_G 1.3257180
M_T_G 0.1225472
Above we can find the contribution to PCA for each variable.
Moreover, the results of the representation:
Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
GDP 0.8977149 0.002358269 0.0903872582 0.0072559282 6.146868e-04
I_TR_EX 0.8940574 0.011925883 0.0890758251 0.0039152555 7.240957e-05
I_TR_IM 0.9497303 0.031663507 0.0005043490 0.0083634817 8.261577e-03
E_TR_EX 0.9442910 0.032514311 0.0009759793 0.0018045016 1.996532e-02
E_TR_IM 0.9494799 0.023498957 0.0072809341 0.0079829875 5.243977e-07
A_T_G 0.9251312 0.022216924 0.0044517320 0.0446861451 5.309397e-04
M_T_G 0.5787884 0.418053945 0.0001063949 0.0006986914 3.145110e-05
Dim.6 Dim.7
GDP 0.0011213078 5.476615e-04
I_TR_EX 0.0001206399 8.325807e-04
I_TR_IM 0.0004905038 9.862744e-04
E_TR_EX 0.0002731752 1.757475e-04
E_TR_IM 0.0116442520 1.124734e-04
A_T_G 0.0029473609 3.571153e-05
M_T_G 0.0023178153 3.301116e-06
Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6
1 -0.7081383 -0.62480743 -0.112336274 -0.201171787 0.14054581 -0.130246854
2 1.4889686 0.19158694 -0.754572000 0.197345371 0.24140882 0.267109310
3 -1.5789218 -0.15750459 0.030584106 0.079321199 -0.07221526 -0.050829273
4 -1.6623837 -0.17777122 0.071245252 0.072900896 -0.05319533 -0.028212932
5 -1.7581685 -0.22041251 0.088862350 0.143517731 -0.08806775 -0.048703436
6 -0.8459732 -0.56271422 -0.413615781 -0.242020377 0.20731004 -0.143108359
7 -0.9465338 -0.01204251 0.090380132 -0.025337131 -0.06761222 0.059823996
8 -1.6721466 -0.10001795 0.058367283 0.082734696 -0.07982926 0.011938624
9 -1.0917732 0.14774514 0.118859931 0.004779335 -0.07087744 0.117784488
10 3.5990562 -0.30028081 0.813951178 -0.140561544 0.48246822 0.095626890
Dim.7
1 0.0442750157
2 0.1641505775
3 -0.0007283838
4 0.0031378994
5 -0.0155473461
6 -0.0239781655
7 -0.0130275743
8 -0.0072342891
9 0.0029381720
10 -0.0694704145
[ reached getOption("max.print") -- omitted 18 rows ]
Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6
1 0.29172042 2.571280e+00 2.337838e-01 1.934705528 2.39329278 3.203087864
2 1.28973874 2.417624e-01 1.054814e+01 1.861806795 7.06100089 13.471390160
3 1.45028019 1.633967e-01 1.732870e-02 0.300787141 0.63185584 0.487822872
4 1.60765644 2.081517e-01 9.403430e-02 0.254065943 0.34285222 0.150290247
5 1.79825665 3.199849e-01 1.462884e-01 0.984672573 0.93970999 0.447871639
6 0.41633593 2.085608e+00 3.169338e+00 2.800171236 5.20715697 3.866912727
7 0.52119824 9.551931e-04 1.513284e-01 0.030689909 0.55387321 0.675748677
8 1.62659482 6.588899e-02 6.311221e-02 0.327232233 0.77211898 0.026911808
9 0.69341867 1.437750e-01 2.617251e-01 0.001091983 0.60866175 2.619452716
10 7.53542829 5.938984e-01 1.227358e+01 0.944525667 28.20313818 1.726610699
Dim.7
1 2.598975e+00
2 3.572477e+01
3 7.034047e-04
4 1.305457e-02
5 3.204772e-01
6 7.622836e-01
7 2.250152e-01
8 6.938672e-02
9 1.144561e-02
10 6.398592e+00
[ reached getOption("max.print") -- omitted 18 rows ]
Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6
1 0.50981502 0.3968888091 1.282970e-02 4.114439e-02 2.008225e-02 1.724690e-02
2 0.73444868 0.0121596818 1.886217e-01 1.290162e-02 1.930619e-02 2.363570e-02
3 0.98427287 0.0097944646 3.693056e-04 2.484121e-03 2.058981e-03 1.020051e-03
4 0.98375714 0.0112498888 1.806913e-03 1.891869e-03 1.007331e-03 2.833491e-04
5 0.97248941 0.0152839763 2.484274e-03 6.480000e-03 2.440044e-03 7.462478e-04
6 0.53972055 0.2387984648 1.290179e-01 4.417328e-02 3.241130e-02 1.544492e-02
7 0.98108239 0.0001588062 8.944976e-03 7.029883e-04 5.005916e-03 3.919079e-03
8 0.99049808 0.0035437314 1.206824e-03 2.424824e-03 2.257507e-03 5.049094e-05
9 0.95598152 0.0175069720 1.133067e-02 1.831975e-05 4.029031e-03 1.112656e-02
10 0.92705697 0.0064533367 4.741610e-02 1.414042e-03 1.665968e-02 6.544692e-04
Dim.7
1 1.992939e-03
2 8.926377e-03
3 2.094666e-07
4 3.505124e-06
5 7.604601e-05
6 4.335984e-04
7 1.858490e-04
8 1.853944e-05
9 6.923713e-06
10 3.454053e-04
[ reached getOption("max.print") -- omitted 18 rows ]
8) Conclusions:
As a result of the above analysis, we can conclude that it is possible to reduce dimensions to the dataset selected, not loosing a lot of information.
The first step of the analysis allow us to conclude that all the variables were highly correlated, probably because all of them depend on how big the country is. This means that bigger economies will have a big GDP, but also the same for the rest of variables. The European Union 28 has freedom in terms of intra-comunitari trade, which means that there are not any barriers, also for the extra-comunitari trade, as the European Union 28 has common trade policy against external countries, via tariff, or free trade agreements (for example: CETA, between EU-28 and Canada). Consequently, as per previous arguments big economies will have bigger GDP, volume of intra-comunitary and extra-comunitari trade, but also air transport of goods.
On the other hand, there is one variables less correlated, sea transport of goods, as this depends not only on how big is the economy, but also it depends on geographical reasons, as there are countries with zero values - they have not access to sea ports.
After applying dimensionality reduction techniques, we can find the best goodness of fit with two variables, this means that we reduced five features. Moreover, we checked that one variable alone could contain almost 80% of the information, but one of the main reasons of applying dimensionality reduction apart from compression data, is visualization, and at least we need two variables in order to make a graphical representation. Additionally, we consider the variables were highly correlated, hence we set up a low acceptable level of stress, less than 2% and we achieved it with two variables.
9) Extensions - Image compressing:
One of the applications of dimensionality reduction is image compressing, when we want to reduce the weight in term of memory of the image, not losing the quality and shape.
First we should install the package jpeg, in order to manipulate .jpeg files:
Original photo
We can see the original photo below (1.243.080 bytes - 1,2 MB).
[1] 1836 3264 3
As we can see above the Rome image is now represented as three matrices in an array of 3264x1836. Each matrix corresponds to the RGB (red, green and blue) color value scheme.
Now we will extract the individual color value matrices to perform PCA on each one:
We performed PCA on each color value matrix:
Now we collect the PCA objects into the list rgb.pca:
At this point we are ready to compress the image. After the principal components were founded for each color value matrix, we have a new dimensions that describe the original pixels (data).
We are going to apply one for loop in order to reconstruct the original image using the projections of the data. We will create different files, with different number of components:
3 components(182.824 bytes - 188 KB):
3 components
43 components(332.384 bytes - 336 KB):
43 components
83 components(375.131 bytes - 377 KB):
83 components
523 components(409.873 bytes - 414 KB):
523 components
9) Summary and conclussions - Imagine Compressing:
Original photo 1836 components (1.243.080 bytes - 1,2 MB) 3 components (182.824 bytes - 188 KB) 43 components (332.384 bytes - 336 KB) 83 components (375.131 bytes - 377 KB) 523 components (409.873 bytes - 414 KB)
As we can see in the different photos, we reduced the components but also the memory used. The original photo was taken with my smart-phone during one travel to Rome, and as we can see the quality is really good, with a lot of components, but I cannot see a lot of difference between the original and the one with 523 components. If we needed to make zoom to the photo of 523 components, for sure we would feel the difference, but without zoom is more or less the same. The one with 3 components is not really useful, as we lost a lot of information, we are not able for example to see the shape of the sculpture. In the next one, with 43 components we can see the shape, but the quality is not really good, even we can see a lot of pixels. Finally in the one of 83 we can see the shape of the figure, also we can see more pixels than the previous one with 43 components.
Image compressing with PCA could have a lot of application, for example when building a website and we want to show images. In case of building a website, if I would like to show the different things that one can visit in Rome with all the details, maybe I will use the one with 523 components, in this case the image was compressed in 67%. On the other hand, it is possible that in one website we would not like to show all the full details, just to show the image with escalated dimension as showed before (almost we will not be able to see the pixels), in this case I would use the one with 83 components, which compressed the picture in almost 70%.
Maybe there is not a lot of difference between the picture with 523 components and the one with 83 components, in terms of compression, but if we want to show several photos on the same page, the sum of the compressions of each one, could cause our website to run in a slow way.