1) Project description:

The goal of this paper is to apply dimensionality reduction techniques to the selected data. To this end, I chose several data from Eurostat-database, in order to discover if it is possible to reduce features without lossing quality of the following dataset: trade in the European Union (28 countries) during 2018

Last but not least, in the extensions we will see how dimensionality reduction through image compression can be applied

2) Dataset description

Description of the diferent features:

  1. GDP Gross domestic product at market prices (2018, million euro)
  2. I_TR_EX Intra-EU28 trade, exports (2018, millions of ECU/EURO)
  3. I_TR_IM Intra-EU28 trade, imports (2018, millions of ECU/EURO)
  4. E_TR_EX Extra-EU28 trade, exports (2018, millions of ECU/EURO)
  5. E_TR_IM Extra-EU28 trade, imports (2018, millions of ECU/EURO)
  6. T_TR_I Total intra-EU28 trade (2018, millions of ECU/EURO)
  7. T_TR_E Total extra-EU28 trade (2018, millions of ECU/EURO)
  8. T_TR_IM Total imports-EU28 trade (2018, millions of ECU/EURO)
  9. T_TR_EX Total exports-EU28 trade (2018, millions of ECU/EURO)
  10. T_GDP_R Total trade to GDP ratio (%)
  11. A_T_G Air transport of goods (2018, tonnes)
  12. T_TR_I = I_TR_EX + I_TR_IM
  13. T_TR_E = E_TR_EX + E_TR_IM
  14. T_TR_IM = I_TR_IM + E_TR_IM
  15. T_TR_EX = I_TR_EX + E_TR_EX
  16. T_GDP_R = ((T_TR_IM + T_TR_EX) / GDP) * 10

Data bibligraphy:

GDP Eurostat database (tipsau10) - https://ec.europa.eu/eurostat/web/products-datasets/-/tipsau10

I_TR_EX Eurostat database (tet00047) - https://ec.europa.eu/eurostat/web/products-datasets/-/tet00047

I_TR_IM Eurostat database (tet00047) - https://ec.europa.eu/eurostat/web/products-datasets/-/tet00047

E_TR_EX Eurostat database (tet00055) - https://ec.europa.eu/eurostat/web/products-datasets/-/tet00055

E_TR_IM Eurostat database (tet00055) - https://ec.europa.eu/eurostat/web/products-datasets/-/tet00055

A_T_G Eurostat database (ttr00011) – https://ec.europa.eu/eurostat/web/products-datasets/-/ttr00011

3) Manipulation of the data:

  • First we will load all the necesary packages and the data:
# A tibble: 28 x 8
   C_EU          GDP I_TR_EX I_TR_IM E_TR_EX E_TR_IM   A_T_G  M_T_G
   <chr>       <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>  <dbl>
 1 Austria   385712. 111673. 127260.  44756.  36748.  237701      0
 2 Belgium   459820. 287689. 245472. 107201  136329. 1416428 270317
 3 Bulgaria   56087.  19275.  20403.   8821.  11702.   29867  27868
 4 Croatia    51579.  10001.  18557.   4749.   5330.   11934  21573
 5 Cyprus     21138.   1250.   5277.   3001.   3890.   32186   6948
 6 Czechia   207772. 144491. 119732.  26769.  36726.   90526      0
 7 Denmark   301341.  56594.  60784.  36013   26001   242068  95835
 8 Estonia    26036.   9814.  12435.   4611.   3794.   11475  35947
 9 Finland   234453   37884.  46716.  26352.  19861   196810 116764
10 France   2353090  290669. 392438. 201915. 175901. 2407878 308629
# … with 18 more rows
     C_EU                GDP             I_TR_EX          I_TR_IM      
 Length:28          Min.   :  12324   Min.   :  1250   Min.   :  3867  
 Class :character   1st Qu.:  54960   1st Qu.: 17411   1st Qu.: 21085  
 Mode  :character   Median : 205834   Median : 62119   Median : 61327  
                    Mean   : 567906   Mean   :125855   Mean   :123237  
                    3rd Qu.: 477487   3rd Qu.:181395   3rd Qu.:165952  
                    Max.   :3344370   Max.   :778747   Max.   :722546  
    E_TR_EX          E_TR_IM           A_T_G             M_T_G       
 Min.   :  1081   Min.   :  1491   Min.   :  11475   Min.   :     0  
 1st Qu.:  8871   1st Qu.: 11184   1st Qu.:  28557   1st Qu.: 17917  
 Median : 22850   Median : 25792   Median : 157448   Median : 58458  
 Mean   : 69923   Mean   : 70741   Mean   : 638555   Mean   :145942  
 3rd Qu.: 76874   3rd Qu.: 85964   3rd Qu.: 828640   3rd Qu.:210472  
 Max.   :541985   Max.   :364885   Max.   :4842716   Max.   :604542  

As we can see above, our data frame “trade” has 28 rows and 8 rows, also we can see the main statistics.

A new data frame x.trade.df was created, in order to have only the selected features, we will obtain a data frame with 28 rows and 7 columns:

The dataset and the features selected are the same as the project clustering, but in this case, it was included M_T_G, this is the sea transport of goods.

I included this feature because I know that it is not highly correlated with the rest of features, as the others depend on how big the country is, but the sea transport of goods also depends on the access that the country has to the sea. As we can see, there are for example countries with zero values, the contrary to A_T_G (air transport of goods), as all the countries are well developed, and they have access to air transport.

At this point we are ready to begin with dimensionality reduction, after maybe we will have the need to standardize or normalized the data frame, but it would be applied in the corresponding method.

4) Correlation between variables:

Despite the fact that the package corrplot was installed in the beginning, several times we need to deactivate and activate once again, hence I need to run the code above in order to run a fancy correlation plot.

         GDP I_TR_EX I_TR_IM E_TR_EX E_TR_IM A_T_G M_T_G
GDP     1.00    0.82    0.93    0.94    0.89  0.92  0.69
I_TR_EX 0.82    1.00    0.95    0.93    0.92  0.89  0.65
I_TR_IM 0.93    0.95    1.00    0.97    0.91  0.95  0.63
E_TR_EX 0.94    0.93    0.97    1.00    0.91  0.95  0.63
E_TR_IM 0.89    0.92    0.91    0.91    1.00  0.92  0.83
A_T_G   0.92    0.89    0.95    0.95    0.92  1.00  0.63
M_T_G   0.69    0.65    0.63    0.63    0.83  0.63  1.00

As we can see in the graph and on the table above, all the variables are highly correlated, but M_T_G (sea transport of good) has lower correlation between variables than the rest of the data, hence the previous decision of incorporation this variable has sense, in order not to have solely highly correlation.

Due to the fact that all the variables are highly correlated, a priori, it looks like that some features can be reduced, without losing a lot of information.

Additionally it is possible to run other kind of graphs to check the correlation, as the one below:

5) MDS - Multidimensional Scaling:

I will apply the dimensionality reduction method MDS.

Fist of all one should standardize the variables:

            GDP     I_TR_EX     I_TR_IM    E_TR_EX    E_TR_IM      A_T_G
[1,] -0.2117651 -0.08321239  0.02573900 -0.2214649 -0.3418405 -0.3587276
[2,] -0.1256292  0.94957815  0.78205075  0.3280349  0.6595468  0.6961247
[3,] -0.5948892 -0.62536579 -0.65792005 -0.5376829 -0.5937038 -0.5447199
[4,] -0.6001286 -0.67978063 -0.66973058 -0.5735100 -0.6577816 -0.5607683
[5,] -0.6355106 -0.73112789 -0.75469669 -0.5888910 -0.6722612 -0.5426446
[6,] -0.4185847  0.10935070 -0.02242635 -0.3797415 -0.3420598 -0.4904357
          M_T_G
[1,] -0.7978678
[2,]  0.6799595
[3,] -0.6455130
[4,] -0.6799278
[5,] -0.7598829
[6,] -0.7978678
      GDP             I_TR_EX           I_TR_IM           E_TR_EX        
 Min.   :-0.6458   Min.   :-0.7311   Min.   :-0.7637   Min.   :-0.60579  
 1st Qu.:-0.5962   1st Qu.:-0.6363   1st Qu.:-0.6536   1st Qu.:-0.53724  
 Median :-0.4208   Median :-0.3740   Median :-0.3961   Median :-0.41422  
 Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.00000  
 3rd Qu.:-0.1051   3rd Qu.: 0.3259   3rd Qu.: 0.2733   3rd Qu.: 0.06116  
 Max.   : 3.2271   Max.   : 3.8309   Max.   : 3.8343   Max.   : 4.15400  
    E_TR_IM            A_T_G             M_T_G        
 Min.   :-0.6964   Min.   :-0.5612   Min.   :-0.7979  
 1st Qu.:-0.5989   1st Qu.:-0.5459   1st Qu.:-0.6999  
 Median :-0.4520   Median :-0.4305   Median :-0.4783  
 Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
 3rd Qu.: 0.1531   3rd Qu.: 0.1701   3rd Qu.: 0.3528  
 Max.   : 2.9579   Max.   : 3.7623   Max.   : 2.5072  

As we can find above, the distribution of data was standardized with mean 0 for all the variables, also with a maximun value of 4.154 and minimun of -0.7979.

We compute the distance matrix of our standardized variables:

          1        2          3          4         5         6         7
1 0.0000000 2.499499 1.06270323 1.12644689 1.2197880 0.3529495 0.7662019
2 2.4994992 0.000000 3.22348353 3.31131988 3.4159194 2.5563745 2.6280262
           8         9       10        11        12        13        14
1 1.17723988 0.9735547 4.438445  9.554553 1.3917886 0.5348002 0.6330087
2 3.30898281 2.7582468 2.722305  7.628977 2.8304836 2.8438122 2.6342803
        15        16        17        18         19       20        21
1 3.990972 1.2078687 1.0989429 1.2580809 1.24185007 5.079678 0.7499864
2 2.209081 3.2504548 3.1837662 3.1665807 3.44822826 2.726870 1.9314882
         22        23        24         25       26       27       28
1 0.8716965 0.7323349 0.7218674 1.02444657 3.295780 1.026152 5.184801
2 2.8063779 2.8892734 3.0213397 3.20528380 1.816859 2.207987 3.336532
 [ reached getOption("max.print") -- omitted 4 rows ]

Analyzing the above table, we obtain the result that the country 11 is the most distinct one, we will check which country has the value 11:

# A tibble: 1 x 1
  C_EU   
  <chr>  
1 Germany

Germany is the country number 11.

We are ready to perform the multidimensional scaling of our distance matrix with the function cmdscale:

We will plot the previous results, using the function ggplot2. Accordingly we obtain a solution of multidimensional scaling with two dimensions:

With this graph we can confirm that the country Germany (11) has the most distinct value.

Subsequently we will check the eigen values, also we will compute the goodness of fit of the solution:

 [1] 165.7582  14.6403   5.2051   2.0171   0.7959   0.5107   0.0727   0.0000
 [9]   0.0000   0.0000   0.0000   0.0000   0.0000   0.0000   0.0000   0.0000
[17]   0.0000   0.0000   0.0000   0.0000   0.0000   0.0000   0.0000   0.0000
[25]   0.0000   0.0000   0.0000   0.0000
[1] 0.9544893 0.9544893

We obtain 0.9544 as the goodness of fit. It is an elevate value, hence we can conclude that the data was fitted really well.

We will continue with the fitted distances:

In the next plot we will be able to check the fitted distances against the observed distances, also we will plot a red line, as regression line:

As we can observe in the graph above, the red line of regression goes linearly respect to the fitted and observed distances, this can prove that there is a strong correlation between the fitted and observed distances, which was our initial target. Apart from that we can tell that it gives a really good goodness of fit, as the regression line has almost 45 degrees.

[1] "Coefficient of determination: 0.995639837782487"

The coefficient of determination confirms as well, that we have fitted the data very well, with high value of 0.9956.

6) Non-metric MDS:

In this case, contrary to the previous technique (MDS), it finds a non-parametric monotonic relationship between the dissimilarities in the item matrix and the Euclidean distances between items, and the location of each item in the low-dimensional space.

initial  value 41.066481 
iter   5 value 38.587966
iter  10 value 34.783267
iter  15 value 29.682143
iter  20 value 26.171888
iter  25 value 23.048600
iter  30 value 20.836697
iter  35 value 19.520166
iter  40 value 18.740493
iter  45 value 18.091546
iter  50 value 17.573080
iter  55 value 17.134058
iter  60 value 16.718544
iter  65 value 16.223233
iter  70 value 15.590125
iter  75 value 14.475025
iter  80 value 12.664584
iter  85 value 9.133363
iter  90 value 7.042891
iter  95 value 5.935541
iter 100 value 5.064410
final  value 5.064410 
stopped after 100 iterations

Above we obtain the plot of two dimensional solutions.

Run 0 stress 0.01738228 
Run 1 stress 0.02665436 
Run 2 stress 0.02568885 
Run 3 stress 0.01848068 
Run 4 stress 0.02245584 
Run 5 stress 0.02078068 
Run 6 stress 0.01832534 
Run 7 stress 0.0215735 
Run 8 stress 0.02260607 
Run 9 stress 0.01827988 
Run 10 stress 0.01757911 
... Procrustes: rmse 0.008121119  max resid 0.03040519 
Run 11 stress 0.02458097 
Run 12 stress 0.01837971 
Run 13 stress 0.02457995 
Run 14 stress 0.03100273 
Run 15 stress 0.02121632 
Run 16 stress 0.01738548 
... Procrustes: rmse 0.005244945  max resid 0.01942814 
Run 17 stress 0.02076053 
Run 18 stress 0.02072885 
Run 19 stress 0.01756905 
... Procrustes: rmse 0.008072044  max resid 0.03022238 
Run 20 stress 0.01827637 
*** No convergence -- monoMDS stopping criteria:
    15: no. of iterations >= maxit
     5: stress ratio > sratmax

Above we have the results of the stress (or goodness of fit), we should minimize it in order to obtain the MDS solution.

Now we will graphically check the stress:

We obtain a coefficient of determination of 0.9967, we can also see in the above graph that in almost all cases the fitted distances are perfectly related with observed distances.

Now we will compute the stress for all the dimensions, and we will explain how many dimensions are required in order to obtain a good fit:

Run 0 stress 0.05281717 
Run 1 stress 0.2217868 
Run 2 stress 0.1732915 
Run 3 stress 0.05282769 
... Procrustes: rmse 0.00106163  max resid 0.004544145 
... Similar to previous best
Run 4 stress 0.05282506 
... Procrustes: rmse 0.001037201  max resid 0.004515597 
... Similar to previous best
Run 5 stress 0.173177 
Run 6 stress 0.05307755 
... Procrustes: rmse 0.002012024  max resid 0.008901526 
... Similar to previous best
Run 7 stress 0.5560847 
Run 8 stress 0.5321585 
Run 9 stress 0.1863158 
Run 10 stress 0.2336482 
Run 11 stress 0.2234388 
Run 12 stress 0.1504285 
Run 13 stress 0.05282555 
... Procrustes: rmse 0.0004841585  max resid 0.001815399 
... Similar to previous best
Run 14 stress 0.2361418 
Run 15 stress 0.5562634 
Run 16 stress 0.2289602 
Run 17 stress 0.1731504 
Run 18 stress 0.552298 
Run 19 stress 0.185972 
Run 20 stress 0.1736204 
*** Solution reached
Run 0 stress 0.01738228 
Run 1 stress 0.03100088 
Run 2 stress 0.02111841 
Run 3 stress 0.02150661 
Run 4 stress 0.01849076 
Run 5 stress 0.02246308 
Run 6 stress 0.02420226 
Run 7 stress 0.01781484 
... Procrustes: rmse 0.01772445  max resid 0.06540565 
Run 8 stress 0.01766881 
... Procrustes: rmse 0.01021305  max resid 0.03832584 
Run 9 stress 0.3894621 
Run 10 stress 0.03133328 
Run 11 stress 0.02113734 
Run 12 stress 0.01827264 
Run 13 stress 0.01826271 
Run 14 stress 0.01996776 
Run 15 stress 0.01763289 
... Procrustes: rmse 0.01498576  max resid 0.05528703 
Run 16 stress 0.02457944 
Run 17 stress 0.03133148 
Run 18 stress 0.01891729 
Run 19 stress 0.01741122 
... Procrustes: rmse 0.002271449  max resid 0.00845779 
... Similar to previous best
Run 20 stress 0.01740315 
... Procrustes: rmse 0.001710116  max resid 0.006362149 
... Similar to previous best
*** Solution reached
Run 0 stress 0.004922049 
Run 1 stress 0.005034798 
... Procrustes: rmse 0.01212265  max resid 0.03559764 
Run 2 stress 0.005159033 
... Procrustes: rmse 0.01226255  max resid 0.0440695 
Run 3 stress 0.006996777 
Run 4 stress 0.01075896 
Run 5 stress 0.007540382 
Run 6 stress 0.006991977 
Run 7 stress 0.007230715 
Run 8 stress 0.00777392 
Run 9 stress 0.008956499 
Run 10 stress 0.009431452 
Run 11 stress 0.006409408 
Run 12 stress 0.006398254 
Run 13 stress 0.01092658 
Run 14 stress 0.005068423 
... Procrustes: rmse 0.009942255  max resid 0.03492989 
Run 15 stress 0.01536551 
Run 16 stress 0.008970558 
Run 17 stress 0.008161329 
Run 18 stress 0.01189896 
Run 19 stress 0.006747112 
Run 20 stress 0.007377814 
*** No convergence -- monoMDS stopping criteria:
    20: no. of iterations >= maxit
Run 0 stress 0.0005684983 
Run 1 stress 0.005124519 
Run 2 stress 0.00313353 
Run 3 stress 0.003063887 
Run 4 stress 0.001130098 
Run 5 stress 0.002415334 
Run 6 stress 0.003542343 
Run 7 stress 0.008225808 
Run 8 stress 0.003611772 
Run 9 stress 0.004926459 
Run 10 stress 0.002758216 
Run 11 stress 0.0009998899 
... Procrustes: rmse 0.01340251  max resid 0.03291724 
Run 12 stress 0.00450292 
Run 13 stress 0.004131073 
Run 14 stress 0.004653221 
Run 15 stress 0.004779995 
Run 16 stress 0.003550253 
Run 17 stress 0.005096152 
Run 18 stress 0.004022344 
Run 19 stress 0.00287027 
Run 20 stress 0.001638391 
*** No convergence -- monoMDS stopping criteria:
    20: no. of iterations >= maxit
Run 0 stress 9.951893e-05 
Run 1 stress 0.0008508154 
Run 2 stress 0.002194416 
Run 3 stress 0.004169654 
Run 4 stress 0.002355673 
Run 5 stress 0.002029783 
Run 6 stress 0.002243723 
Run 7 stress 0.002041155 
Run 8 stress 0.00221081 
Run 9 stress 0.001388282 
Run 10 stress 0.001669469 
Run 11 stress 0.002571836 
Run 12 stress 0.0012516 
Run 13 stress 0.0007910094 
Run 14 stress 0.001337923 
Run 15 stress 0.001331504 
Run 16 stress 0.003299361 
Run 17 stress 0.00172367 
Run 18 stress 0.002965277 
Run 19 stress 0.003053235 
Run 20 stress 0.002452937 
*** No convergence -- monoMDS stopping criteria:
    20: no. of iterations >= maxit
Run 0 stress 6.497191e-05 
Run 1 stress 0.000996892 
Run 2 stress 0.002420061 
Run 3 stress 0.003097316 
Run 4 stress 0.002399856 
Run 5 stress 0.002591253 
Run 6 stress 0.001428916 
Run 7 stress 0.001944264 
Run 8 stress 0.0006884085 
Run 9 stress 0.001526541 
Run 10 stress 0.00216325 
Run 11 stress 0.00191342 
Run 12 stress 0.001818719 
Run 13 stress 0.002417233 
Run 14 stress 0.001162766 
Run 15 stress 0.002723075 
Run 16 stress 0.002553233 
Run 17 stress 0.001472004 
Run 18 stress 0.002449874 
Run 19 stress 0.001763495 
Run 20 stress 0.001525711 
*** No convergence -- monoMDS stopping criteria:
    20: no. of iterations >= maxit
Run 0 stress 0 
Run 1 stress 0.001370632 
Run 2 stress 0.0009133847 
Run 3 stress 0.0007920234 
Run 4 stress 0.0009817165 
Run 5 stress 0.001604746 
Run 6 stress 0.001329391 
Run 7 stress 0.001881362 
Run 8 stress 0.001098428 
Run 9 stress 0.00199932 
Run 10 stress 0.002371378 
Run 11 stress 0.001692948 
Run 12 stress 0.002828057 
Run 13 stress 0.00116458 
Run 14 stress 0.003370393 
Run 15 stress 0.003534193 
Run 16 stress 0.002540077 
Run 17 stress 0.00100504 
Run 18 stress 0.001356418 
Run 19 stress 0.000723162 
Run 20 stress 0.002597001 
*** No convergence -- monoMDS stopping criteria:
    20: no. of iterations >= maxit
Run 0 stress 0 
Run 1 stress 0.002580288 
Run 2 stress 0.0009090862 
Run 3 stress 0.00157642 
Run 4 stress 0.001814163 
Run 5 stress 0.001873632 
Run 6 stress 0.001450652 
Run 7 stress 0.002143105 
Run 8 stress 0.002803625 
Run 9 stress 0.0008657835 
Run 10 stress 0.003171441 
Run 11 stress 0.0007095645 
Run 12 stress 0.002457728 
Run 13 stress 0.003629506 
Run 14 stress 0.002011247 
Run 15 stress 0.00139924 
Run 16 stress 0.000921801 
Run 17 stress 0.002123891 
Run 18 stress 0.002381936 
Run 19 stress 0.003486536 
Run 20 stress 0.003189123 
*** No convergence -- monoMDS stopping criteria:
    20: no. of iterations >= maxit
Run 0 stress 0 
Run 1 stress 0.002310324 
Run 2 stress 0.001242967 
Run 3 stress 0.0008261568 
Run 4 stress 0.001427361 
Run 5 stress 0.002581567 
Run 6 stress 0.001848706 
Run 7 stress 0.001797331 
Run 8 stress 0.0007404808 
Run 9 stress 0.002163955 
Run 10 stress 0.001756932 
Run 11 stress 0.001823287 
Run 12 stress 0.001849845 
Run 13 stress 0.001394392 
Run 14 stress 0.00248491 
Run 15 stress 0.002130265 
Run 16 stress 0.001155515 
Run 17 stress 0.001404284 
Run 18 stress 0.002026062 
Run 19 stress 0.001522932 
Run 20 stress 0.001226192 
*** No convergence -- monoMDS stopping criteria:
    20: no. of iterations >= maxit
Run 0 stress 0 
Run 1 stress 0.002686181 
Run 2 stress 0.001089588 
Run 3 stress 0.002984528 
Run 4 stress 0.002780958 
Run 5 stress 0.003116096 
Run 6 stress 0.003322968 
Run 7 stress 0.004951522 
Run 8 stress 0.001665136 
Run 9 stress 0.002437215 
Run 10 stress 0.001540673 
Run 11 stress 0.001362519 
Run 12 stress 0.001574775 
Run 13 stress 0.002798844 
Run 14 stress 0.001203459 
Run 15 stress 0.001295739 
Run 16 stress 0.00204832 
Run 17 stress 0.001385791 
Run 18 stress 0.001410418 
Run 19 stress 0.001655983 
Run 20 stress 0.001658941 
*** No convergence -- monoMDS stopping criteria:
    20: no. of iterations >= maxit

Graph of the above stress analysis:

As we can find above, if we consider 2% as the maximum stress acceptable (this low value, because the data is highly correlated). Therefore, we should use two variables, obtaining a little bit less than 2% of stress.

7) PCA - Principal Component Analysis:

Now we will apply the lineal dimensionality reduction method PCA, which minimized the variances.

Firstly, we will normalize the data, as it is the fist step before applying PCA:

      GDP             I_TR_EX           I_TR_IM           E_TR_EX        
 Min.   :-0.6458   Min.   :-0.7311   Min.   :-0.7637   Min.   :-0.60579  
 1st Qu.:-0.5962   1st Qu.:-0.6363   1st Qu.:-0.6536   1st Qu.:-0.53724  
 Median :-0.4208   Median :-0.3740   Median :-0.3961   Median :-0.41422  
 Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.00000  
 3rd Qu.:-0.1051   3rd Qu.: 0.3259   3rd Qu.: 0.2733   3rd Qu.: 0.06116  
 Max.   : 3.2271   Max.   : 3.8309   Max.   : 3.8343   Max.   : 4.15400  
    E_TR_IM            A_T_G             M_T_G        
 Min.   :-0.6964   Min.   :-0.5612   Min.   :-0.7979  
 1st Qu.:-0.5989   1st Qu.:-0.5459   1st Qu.:-0.6999  
 Median :-0.4520   Median :-0.4305   Median :-0.4783  
 Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
 3rd Qu.: 0.1531   3rd Qu.: 0.1701   3rd Qu.: 0.3528  
 Max.   : 2.9579   Max.   : 3.7623   Max.   : 2.5072  

Once the data has been normalized, we can apply PCA:

Above we can find the eigenvalues for each dimension. As we can see, the first variable can contain almost 80% of the information, because there is high correlation between the features.

Plot of the columns for eignvalue for each dimension:

Plot of the columns for eignvalue for each variable:

Plot for columns for each of the dimensions with the vectors:

Representation of the importance of each variable:

As we can see above, the least importance variable is M_T_G (sea transport of goods), because it is less correlated than the others, as we see in the previous steps.

We finished with the graphical results of PCA, now we can see the numerical results:

      eigenvalue variance.percent cumulative.variance.percent
Dim.1 6.13919303      87.70275750                    87.70276
Dim.2 0.54223180       7.74616851                    95.44893
Dim.3 0.19278247       2.75403532                    98.20296
Dim.4 0.07470699       1.06724273                    99.27020
Dim.5 0.02947691       0.42109872                    99.69130
Dim.6 0.01891505       0.27021507                    99.96152
Dim.7 0.00269375       0.03848215                   100.00000

Above we have the table with the eigenvalue for each feature.

We can obtain the results for the variables that we used in this analysis, also the coordinates of the variables:

            Dim.1       Dim.2       Dim.3       Dim.4         Dim.5       Dim.6
GDP     0.9474782 -0.04856201  0.30064474 -0.08518174  0.0247928788 -0.03348593
I_TR_EX 0.9455461 -0.10920569 -0.29845573 -0.06257200  0.0085093814  0.01098362
I_TR_IM 0.9745411 -0.17794242 -0.02245772 -0.09145207  0.0908932163  0.02214732
E_TR_EX 0.9717463 -0.18031725  0.03124067 -0.04247943 -0.1412986982  0.01652801
E_TR_IM 0.9744126  0.15329370 -0.08532839  0.08934757 -0.0007241531 -0.10790854
A_T_G   0.9618374 -0.14905343  0.06672130  0.21139098  0.0230421281  0.05428960
M_T_G   0.7607814  0.64657091  0.01031479 -0.02643277 -0.0056081276  0.04814369
               Dim.7
GDP     -0.023402168
I_TR_EX -0.028854475
I_TR_IM  0.031405006
E_TR_EX  0.013256981
E_TR_IM  0.010605346
A_T_G   -0.005975913
M_T_G    0.001816897
           Dim.1     Dim.2       Dim.3      Dim.4        Dim.5      Dim.6
GDP     14.62269  0.434919 46.88562037  9.7125156  2.085316364  5.9281235
I_TR_EX 14.56311  2.199407 46.20535459  5.2408155  0.245648441  0.6377984
I_TR_IM 15.46995  5.839478  0.26161560 11.1950456 28.027281844  2.5931925
E_TR_EX 15.38135  5.996386  0.50625938  2.4154388 67.732071661  1.4442209
E_TR_IM 15.46587  4.333748  3.77676143 10.6857301  0.001779012 61.5607625
A_T_G   15.06926  4.097311  2.30919956 59.8152120  1.801205283 15.5820903
M_T_G    9.42776 77.098752  0.05518908  0.9352424  0.106697395 12.2538121
             Dim.7
GDP     20.3308197
I_TR_EX 30.9078667
I_TR_IM 36.6134329
E_TR_EX  6.5242702
E_TR_IM  4.1753452
A_T_G    1.3257180
M_T_G    0.1225472

Above we can find the contribution to PCA for each variable.

Moreover, the results of the representation:

            Dim.1       Dim.2        Dim.3        Dim.4        Dim.5
GDP     0.8977149 0.002358269 0.0903872582 0.0072559282 6.146868e-04
I_TR_EX 0.8940574 0.011925883 0.0890758251 0.0039152555 7.240957e-05
I_TR_IM 0.9497303 0.031663507 0.0005043490 0.0083634817 8.261577e-03
E_TR_EX 0.9442910 0.032514311 0.0009759793 0.0018045016 1.996532e-02
E_TR_IM 0.9494799 0.023498957 0.0072809341 0.0079829875 5.243977e-07
A_T_G   0.9251312 0.022216924 0.0044517320 0.0446861451 5.309397e-04
M_T_G   0.5787884 0.418053945 0.0001063949 0.0006986914 3.145110e-05
               Dim.6        Dim.7
GDP     0.0011213078 5.476615e-04
I_TR_EX 0.0001206399 8.325807e-04
I_TR_IM 0.0004905038 9.862744e-04
E_TR_EX 0.0002731752 1.757475e-04
E_TR_IM 0.0116442520 1.124734e-04
A_T_G   0.0029473609 3.571153e-05
M_T_G   0.0023178153 3.301116e-06
        Dim.1       Dim.2        Dim.3        Dim.4       Dim.5        Dim.6
1  -0.7081383 -0.62480743 -0.112336274 -0.201171787  0.14054581 -0.130246854
2   1.4889686  0.19158694 -0.754572000  0.197345371  0.24140882  0.267109310
3  -1.5789218 -0.15750459  0.030584106  0.079321199 -0.07221526 -0.050829273
4  -1.6623837 -0.17777122  0.071245252  0.072900896 -0.05319533 -0.028212932
5  -1.7581685 -0.22041251  0.088862350  0.143517731 -0.08806775 -0.048703436
6  -0.8459732 -0.56271422 -0.413615781 -0.242020377  0.20731004 -0.143108359
7  -0.9465338 -0.01204251  0.090380132 -0.025337131 -0.06761222  0.059823996
8  -1.6721466 -0.10001795  0.058367283  0.082734696 -0.07982926  0.011938624
9  -1.0917732  0.14774514  0.118859931  0.004779335 -0.07087744  0.117784488
10  3.5990562 -0.30028081  0.813951178 -0.140561544  0.48246822  0.095626890
           Dim.7
1   0.0442750157
2   0.1641505775
3  -0.0007283838
4   0.0031378994
5  -0.0155473461
6  -0.0239781655
7  -0.0130275743
8  -0.0072342891
9   0.0029381720
10 -0.0694704145
 [ reached getOption("max.print") -- omitted 18 rows ]
         Dim.1        Dim.2        Dim.3        Dim.4       Dim.5        Dim.6
1   0.29172042 2.571280e+00 2.337838e-01  1.934705528  2.39329278  3.203087864
2   1.28973874 2.417624e-01 1.054814e+01  1.861806795  7.06100089 13.471390160
3   1.45028019 1.633967e-01 1.732870e-02  0.300787141  0.63185584  0.487822872
4   1.60765644 2.081517e-01 9.403430e-02  0.254065943  0.34285222  0.150290247
5   1.79825665 3.199849e-01 1.462884e-01  0.984672573  0.93970999  0.447871639
6   0.41633593 2.085608e+00 3.169338e+00  2.800171236  5.20715697  3.866912727
7   0.52119824 9.551931e-04 1.513284e-01  0.030689909  0.55387321  0.675748677
8   1.62659482 6.588899e-02 6.311221e-02  0.327232233  0.77211898  0.026911808
9   0.69341867 1.437750e-01 2.617251e-01  0.001091983  0.60866175  2.619452716
10  7.53542829 5.938984e-01 1.227358e+01  0.944525667 28.20313818  1.726610699
          Dim.7
1  2.598975e+00
2  3.572477e+01
3  7.034047e-04
4  1.305457e-02
5  3.204772e-01
6  7.622836e-01
7  2.250152e-01
8  6.938672e-02
9  1.144561e-02
10 6.398592e+00
 [ reached getOption("max.print") -- omitted 18 rows ]
        Dim.1        Dim.2        Dim.3        Dim.4        Dim.5        Dim.6
1  0.50981502 0.3968888091 1.282970e-02 4.114439e-02 2.008225e-02 1.724690e-02
2  0.73444868 0.0121596818 1.886217e-01 1.290162e-02 1.930619e-02 2.363570e-02
3  0.98427287 0.0097944646 3.693056e-04 2.484121e-03 2.058981e-03 1.020051e-03
4  0.98375714 0.0112498888 1.806913e-03 1.891869e-03 1.007331e-03 2.833491e-04
5  0.97248941 0.0152839763 2.484274e-03 6.480000e-03 2.440044e-03 7.462478e-04
6  0.53972055 0.2387984648 1.290179e-01 4.417328e-02 3.241130e-02 1.544492e-02
7  0.98108239 0.0001588062 8.944976e-03 7.029883e-04 5.005916e-03 3.919079e-03
8  0.99049808 0.0035437314 1.206824e-03 2.424824e-03 2.257507e-03 5.049094e-05
9  0.95598152 0.0175069720 1.133067e-02 1.831975e-05 4.029031e-03 1.112656e-02
10 0.92705697 0.0064533367 4.741610e-02 1.414042e-03 1.665968e-02 6.544692e-04
          Dim.7
1  1.992939e-03
2  8.926377e-03
3  2.094666e-07
4  3.505124e-06
5  7.604601e-05
6  4.335984e-04
7  1.858490e-04
8  1.853944e-05
9  6.923713e-06
10 3.454053e-04
 [ reached getOption("max.print") -- omitted 18 rows ]

8) Conclusions:

As a result of the above analysis, we can conclude that it is possible to reduce dimensions to the dataset selected, not loosing a lot of information.

The first step of the analysis allow us to conclude that all the variables were highly correlated, probably because all of them depend on how big the country is. This means that bigger economies will have a big GDP, but also the same for the rest of variables. The European Union 28 has freedom in terms of intra-comunitari trade, which means that there are not any barriers, also for the extra-comunitari trade, as the European Union 28 has common trade policy against external countries, via tariff, or free trade agreements (for example: CETA, between EU-28 and Canada). Consequently, as per previous arguments big economies will have bigger GDP, volume of intra-comunitary and extra-comunitari trade, but also air transport of goods.

On the other hand, there is one variables less correlated, sea transport of goods, as this depends not only on how big is the economy, but also it depends on geographical reasons, as there are countries with zero values - they have not access to sea ports.

After applying dimensionality reduction techniques, we can find the best goodness of fit with two variables, this means that we reduced five features. Moreover, we checked that one variable alone could contain almost 80% of the information, but one of the main reasons of applying dimensionality reduction apart from compression data, is visualization, and at least we need two variables in order to make a graphical representation. Additionally, we consider the variables were highly correlated, hence we set up a low acceptable level of stress, less than 2% and we achieved it with two variables.

9) Extensions - Image compressing:

One of the applications of dimensionality reduction is image compressing, when we want to reduce the weight in term of memory of the image, not losing the quality and shape.

First we should install the package jpeg, in order to manipulate .jpeg files:

Original photo

Original photo

We can see the original photo below (1.243.080 bytes - 1,2 MB).

[1] 1836 3264    3

As we can see above the Rome image is now represented as three matrices in an array of 3264x1836. Each matrix corresponds to the RGB (red, green and blue) color value scheme.

Now we will extract the individual color value matrices to perform PCA on each one:

We performed PCA on each color value matrix:

Now we collect the PCA objects into the list rgb.pca:

At this point we are ready to compress the image. After the principal components were founded for each color value matrix, we have a new dimensions that describe the original pixels (data).

We are going to apply one for loop in order to reconstruct the original image using the projections of the data. We will create different files, with different number of components:

  • 3 components (182.824 bytes - 188 KB):
3 components

3 components

  • 43 components (332.384 bytes - 336 KB):
43 components

43 components

  • 83 components (375.131 bytes - 377 KB):
83 components

83 components

  • 523 components (409.873 bytes - 414 KB):
523 components

523 components

9) Summary and conclussions - Imagine Compressing:

Original photo 1836 components (1.243.080 bytes - 1,2 MB) 3 components (182.824 bytes - 188 KB) 43 components (332.384 bytes - 336 KB) 83 components (375.131 bytes - 377 KB) 523 components (409.873 bytes - 414 KB)

As we can see in the different photos, we reduced the components but also the memory used. The original photo was taken with my smart-phone during one travel to Rome, and as we can see the quality is really good, with a lot of components, but I cannot see a lot of difference between the original and the one with 523 components. If we needed to make zoom to the photo of 523 components, for sure we would feel the difference, but without zoom is more or less the same. The one with 3 components is not really useful, as we lost a lot of information, we are not able for example to see the shape of the sculpture. In the next one, with 43 components we can see the shape, but the quality is not really good, even we can see a lot of pixels. Finally in the one of 83 we can see the shape of the figure, also we can see more pixels than the previous one with 43 components.

Image compressing with PCA could have a lot of application, for example when building a website and we want to show images. In case of building a website, if I would like to show the different things that one can visit in Rome with all the details, maybe I will use the one with 523 components, in this case the image was compressed in 67%. On the other hand, it is possible that in one website we would not like to show all the full details, just to show the image with escalated dimension as showed before (almost we will not be able to see the pixels), in this case I would use the one with 83 components, which compressed the picture in almost 70%.

Maybe there is not a lot of difference between the picture with 523 components and the one with 83 components, in terms of compression, but if we want to show several photos on the same page, the sum of the compressions of each one, could cause our website to run in a slow way.