Last update: Aug 07 12:51:42 PM 2015 CEST
At prevalidation step we are to make a decision should we accept data from a specific country for the further processing or not. A country could provide data of good quality for one part of commodities and inadequate level of quality for another part. We want to estimate quality differences between commodities of a country.
Quality of data is estimated by following indicators:
Procedure:
On input:
On output:
Vector of logical values (is it missing or not)
missingIndicator <- function(indicator, missingCases) {
indicator %in% missingCases
}
We suggest to use median reporter unit value.
Input:
## year reporter partner hs flow weight qty qunit value hs2 hs4 hs6
## 1 2009 100 804 081090 2 17390 17390 8 19964.937 08 0810 081090
## 2 2009 100 430 210320 2 329 329 8 1179.756 21 2103 210320
## 3 2009 100 840 130219 2 1271 1271 8 74006.067 13 1302 130219
## 4 2009 100 616 210390 1 326857 326857 8 1372406.708 21 2103 210390
## 5 2009 100 688 151620 2 239300 239300 8 236708.599 15 1516 151620
## 6 2009 100 380 110510 1 1000 1000 8 2603.026 11 1105 110510
## hs2 total_rows
## 1 01 20949
## 2 02 78765
## 3 03 5500
## 4 04 88846
## 5 05 18617
## 6 06 35992
We identify which reporters provide data of insufficient quality. Firstly for every reporter proportion of trade flows with missing quantity is calculated.
Four countries, Bermuda, Lesotho, Palau and Palestine, in 2011 didn’t provide quantities at all. Countries with no more than 2% of trade flows with missing quantity are removed from the graph.
In the following graph we calculate proportion of trade flows with missing quantities for every HS heading separately. Some countries provide nearly the same proportion of missing quantities across all HS headings. For example, Germany reports quantities almost under all HS headings. Exceptions include headings 50, 52, 53. But amount of missing indicators is close to zero: no more than 2.44%. In case of the United States there are reported headings with proportion of missing quantities up to 30%.
It was shown before median is suitable measure of central tendency of unit value distribution. Between global and reporter median unit value it is better to choose reporter median value. To compare reporters by amount of outliers for each reporter we calculate median proportion of unit value differences.
\[ Me_{reporter} \left [ \frac{x_{trade flow} - Me_{commodity_{reporter}}} {x_{trade flow}} \right ] \]