This document descibes some explorations on the global trade data (data set tstrade.csv) as provided by the University on Indiana, for a Client Project “Global Trade Visualization Tool” of the Purdue University.

Load the data:

tstrade <- read.csv("~/Desktop/IVMOOC/FinalProject/tstrade.csv")

Obtain basic data information:

str(tstrade)
## 'data.frame':    15352380 obs. of  5 variables:
##  $ TRAD_COMM: Factor w/ 57 levels "atp","b_t","c_b",..: 43 54 18 51 37 3 44 29 9 27 ...
##  $ REG      : Factor w/ 134 levels "alb","are","arg",..: 5 5 5 5 5 5 5 5 5 5 ...
##  $ REG.1    : Factor w/ 134 levels "alb","are","arg",..: 5 5 5 5 5 5 5 5 5 5 ...
##  $ YEAR     : Factor w/ 15 levels "Y1995","Y1996",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Value    : num  1e-06 1e-06 1e-06 1e-06 1e-06 ...
unique(tstrade$YEAR)
##  [1] Y1995 Y1996 Y1997 Y1998 Y1999 Y2000 Y2001 Y2002 Y2003 Y2004 Y2005
## [12] Y2006 Y2007 Y2008 Y2009
## 15 Levels: Y1995 Y1996 Y1997 Y1998 Y1999 Y2000 Y2001 Y2002 Y2003 ... Y2009

There are 15 million observations of 5 variables, data on 57 trading commodities, 134 regions, from 1995 to 2009. This represents a full matrix across the 4 dimensions: 57 * 134 * 134 * 15 = 15352380

Question: There are 2 REG variables, are these the exporting and importing regions (in that order)?
Question: What is the definition of the Value variable? export of TRAD_COMM from REG in USD, mUSD?

summary(tstrade)
##    TRAD_COMM             REG               REG.1         
##  atp    :  269340   alb    :  114570   alb    :  114570  
##  b_t    :  269340   are    :  114570   are    :  114570  
##  c_b    :  269340   arg    :  114570   arg    :  114570  
##  cmn    :  269340   arm    :  114570   arm    :  114570  
##  cmt    :  269340   aus    :  114570   aus    :  114570  
##  cns    :  269340   aut    :  114570   aut    :  114570  
##  (Other):13736340   (Other):14664960   (Other):14664960  
##       YEAR             Value         
##  Y1995  :1023492   Min.   :-3434.00  
##  Y1996  :1023492   1st Qu.:    0.00  
##  Y1997  :1023492   Median :    0.00  
##  Y1998  :1023492   Mean   :    7.32  
##  Y1999  :1023492   3rd Qu.:    0.00  
##  Y2000  :1023492   Max.   :95014.00  
##  (Other):9211428

There are 74 negative “Value” values, REG related:

negVal = subset(tstrade, tstrade$Value < 0)
subset(table(negVal$REG),table(negVal$REG)>0)
## 
## are can hkg usa 
##   1   7  43  23

REG.1 related:

subset(table(negVal$REG.1),table(negVal$REG.1)>0)
## twn 
##  74

Question: Is this related to indirect (re-) trades? Do we have to include them in the visualizations?

Distribution of the “Value” variable

hist(log(tstrade$Value))
## Warning in log(tstrade$Value): NaNs produced

2 main groups can be idenfied: very small values (< 1E-5) and larger values (> 1E-2). Let’s assume that the very small values are placeholders for zero or not available (NA) data and remove these form the dataset. We also ignore the negative values from now.

# The theeshold of 3 is possible to be increased to get a smaller more relevant dataset 
tstrade2 = subset(tstrade, tstrade$Value > 3)
nrow(tstrade2)
## [1] 767479
summary(tstrade2$Value)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     4.0     7.0    17.0   145.1    58.0 95010.0
boxplot(tstrade2$Value)

hist(log(log(tstrade2$Value)))