This document descibes some explorations on the global trade data (data set tstrade.csv) as provided by the University on Indiana, for a Client Project “Global Trade Visualization Tool” of the Purdue University.
Load the data:
tstrade <- read.csv("~/Desktop/IVMOOC/FinalProject/tstrade.csv")
Obtain basic data information:
str(tstrade)
## 'data.frame': 15352380 obs. of 5 variables:
## $ TRAD_COMM: Factor w/ 57 levels "atp","b_t","c_b",..: 43 54 18 51 37 3 44 29 9 27 ...
## $ REG : Factor w/ 134 levels "alb","are","arg",..: 5 5 5 5 5 5 5 5 5 5 ...
## $ REG.1 : Factor w/ 134 levels "alb","are","arg",..: 5 5 5 5 5 5 5 5 5 5 ...
## $ YEAR : Factor w/ 15 levels "Y1995","Y1996",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Value : num 1e-06 1e-06 1e-06 1e-06 1e-06 ...
unique(tstrade$YEAR)
## [1] Y1995 Y1996 Y1997 Y1998 Y1999 Y2000 Y2001 Y2002 Y2003 Y2004 Y2005
## [12] Y2006 Y2007 Y2008 Y2009
## 15 Levels: Y1995 Y1996 Y1997 Y1998 Y1999 Y2000 Y2001 Y2002 Y2003 ... Y2009
There are 15 million observations of 5 variables, data on 57 trading commodities, 134 regions, from 1995 to 2009. This represents a full matrix across the 4 dimensions: 57 * 134 * 134 * 15 = 15352380
Question: There are 2 REG variables, are these the exporting and importing regions (in that order)?
Question: What is the definition of the Value variable? export of TRAD_COMM from REG in USD, mUSD?
summary(tstrade)
## TRAD_COMM REG REG.1
## atp : 269340 alb : 114570 alb : 114570
## b_t : 269340 are : 114570 are : 114570
## c_b : 269340 arg : 114570 arg : 114570
## cmn : 269340 arm : 114570 arm : 114570
## cmt : 269340 aus : 114570 aus : 114570
## cns : 269340 aut : 114570 aut : 114570
## (Other):13736340 (Other):14664960 (Other):14664960
## YEAR Value
## Y1995 :1023492 Min. :-3434.00
## Y1996 :1023492 1st Qu.: 0.00
## Y1997 :1023492 Median : 0.00
## Y1998 :1023492 Mean : 7.32
## Y1999 :1023492 3rd Qu.: 0.00
## Y2000 :1023492 Max. :95014.00
## (Other):9211428
There are 74 negative “Value” values, REG related:
negVal = subset(tstrade, tstrade$Value < 0)
subset(table(negVal$REG),table(negVal$REG)>0)
##
## are can hkg usa
## 1 7 43 23
REG.1 related:
subset(table(negVal$REG.1),table(negVal$REG.1)>0)
## twn
## 74
Question: Is this related to indirect (re-) trades? Do we have to include them in the visualizations?
Distribution of the “Value” variable
hist(log(tstrade$Value))
## Warning in log(tstrade$Value): NaNs produced
2 main groups can be idenfied: very small values (< 1E-5) and larger values (> 1E-2). Let’s assume that the very small values are placeholders for zero or not available (NA) data and remove these form the dataset. We also ignore the negative values from now.
# The theeshold of 3 is possible to be increased to get a smaller more relevant dataset
tstrade2 = subset(tstrade, tstrade$Value > 3)
nrow(tstrade2)
## [1] 767479
summary(tstrade2$Value)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.0 7.0 17.0 145.1 58.0 95010.0
boxplot(tstrade2$Value)
hist(log(log(tstrade2$Value)))