Provide the packages required to reproduce the report. Make sure you fulfilled the minimum requirement #10.
library(readr) #did not work with world bank data
library(WDI) #the official world bank api however can properly interpret the csv file.
library(dplyr)
library(lubridate)
library(outliers)
World trade data for metal imports, metal exports, merchandise imports and merchandise exports are imported and preprocessed in this report. The datasets are unsuccesfuly imported with csv reader programs and instead imported using the WDI api. The data is then combined using the merge function with anchor points being set as country and year. THe data is then investigated to see what datatypes it contains and then some variables datatype are transformed. The data is then judged as in the tidy format.New variables that agglomerate the trade data as net exports and relative net metal exports are mutated. THe dataset was scanned and foudn to contain close to half a set with missing values, which were dropped in cases of totaly empty metal trade data and filled in terms of partial trade data. THe data was also scanned for outliers, multivariate outliers were identified but retained.
The World Bank is a valuable resource for researchers looking for macroeconomic/ social data. Metal commodity trade is an important part of a societies economic profile and the records kept for it for close to a centuries worth of data is available for any given country from the World Bank are contained under the terms TM.VAL.MMTL.ZS.UN for metal imports and TX.VAL.MMTL.ZS.UN for metal exports. These variables represent the percentages of a nations annual merchandise export and import that is of metal: ore or material. Thus, the terms TM.VAL.MRCH.CD.WT and TX.VAL.MRCH.CD.WT which represent annual merchandise exports and imports in terms of Us dollars are also relevant if a user wishes to access net metal trade as a percentage of exports or imports. In this report this trade data is downloaded and pre-processed so that it could possibly be used for further commodity trade analysis. Datasets for metal exports and imports as well as merchandise exports and imports are combined, mutated and then scanned for outliers and missing values. The datasets were not succesfully able to retain numeric data when the csv files were read by readr instead the WDI api was propperly capable at importing the data.
sources of data: https://data.worldbank.org/indicator/TX.VAL.MMTL.ZS.UN https://data.worldbank.org/indicator/TM.VAL.MMTL.ZS.UN https://data.worldbank.org/indicator/TX.VAL.MRCH.ZS.UN https://data.worldbank.org/indicator/TM.VAL.MRCH.ZS.UN
imports <- read_csv(file = "API_TM.VAL.MMTL.ZS.UN_DS2_en_csv_v2_10576723.csv")
Missing column names filled in: 'X3' [3]Parsed with column specification:
cols(
`Data Source` = [31mcol_character()[39m,
`World Development Indicators` = [31mcol_character()[39m,
X3 = [31mcol_character()[39m
)
265 parsing failures.
row col expected actual file
2 -- 3 columns 64 columns 'API_TM.VAL.MMTL.ZS.UN_DS2_en_csv_v2_10576723.csv'
3 -- 3 columns 64 columns 'API_TM.VAL.MMTL.ZS.UN_DS2_en_csv_v2_10576723.csv'
4 -- 3 columns 64 columns 'API_TM.VAL.MMTL.ZS.UN_DS2_en_csv_v2_10576723.csv'
5 -- 3 columns 64 columns 'API_TM.VAL.MMTL.ZS.UN_DS2_en_csv_v2_10576723.csv'
6 -- 3 columns 64 columns 'API_TM.VAL.MMTL.ZS.UN_DS2_en_csv_v2_10576723.csv'
... ... ......... .......... ..................................................
See problems(...) for more details.
#read_csv and read.csv fail to intake numerica data
remove(imports)
metal_imports <- WDI(country = "all", indicator="TM.VAL.MMTL.ZS.UN", start = NULL, end = NULL, extra = TRUE, cache = NULL)
metal_exports <- WDI(country = "all", indicator = "TX.VAL.MMTL.ZS.UN", start = NULL, end = NULL, extra = FALSE, cache = NULL)
merch_exports <- WDI(country = "all", indicator = "TX.VAL.MRCH.CD.WT", start = NULL, end = NULL, extra = FALSE, cache = NULL)
merch_imports <- WDI(country = "all", indicator = "TM.VAL.MRCH.CD.WT", start = NULL, end = NULL, extra = FALSE, cache = NULL)
#the wdi program properly imports the dataset
joining the imports to exports
metal_trade <- merge(metal_imports, metal_exports, by=c("iso2c", "country", "year"))
metal_trade<- metal_trade[c(1:4,12,5:11)]
merch_trade <- merge(merch_imports, merch_exports, by=c("iso2c", "country", "year"))
#rearranging the columns
remove(metal_imports)
remove(metal_exports)
remove(merch_exports)
remove(merch_imports)
#dropping tables no longer used
metal_trade<-metal_trade %>% rename(metal_imports = TM.VAL.MMTL.ZS.UN, metal_exports = TX.VAL.MMTL.ZS.UN )
merch_trade <- merch_trade %>% rename(merch_imports = TM.VAL.MRCH.CD.WT, merch_exports = TX.VAL.MRCH.CD.WT)
metal_trade_net <- merge(metal_trade, merch_trade, by = c("iso2c", "country", "year"))
#renaming the confusing labels for the indicators
#rearranging columns
metal_trade_net <- metal_trade_net[c(1:5,13:14,6:12)]
#dropping tables no longer used
remove(metal_trade)
remove(merch_trade)
head(metal_trade_net)
THis dataset is held within a dataframe datastructure. The dataset holds the identifing variables of country and short hand regions identifier: iso2c, these varaibles are strings. The variables for imports and exports are numerical data. The variable for year is also numeric but changed by the lubridate module to be a date datatype. The extra data is all held as factor data which includes region, capital, longitude, latitude, income and lending capacity. The variable for income is changed to be an ordered factor.
metal_trade_dt <- str(metal_trade_net)
'data.frame': 15576 obs. of 14 variables:
$ iso2c : chr "1A" "1A" "1A" "1A" ...
$ country : chr "Arab World" "Arab World" "Arab World" "Arab World" ...
$ year : int 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 ...
$ metal_imports: num NA NA NA NA NA NA NA NA NA NA ...
$ metal_exports: num NA NA NA NA NA NA NA NA NA NA ...
$ merch_imports: num 4.86e+09 4.86e+09 4.45e+09 5.11e+09 5.51e+09 ...
$ merch_exports: num 4.66e+09 4.60e+09 4.86e+09 6.12e+09 6.81e+09 ...
$ iso3c : Factor w/ 304 levels "ABW","AFG","AFR",..: 8 8 8 8 8 8 8 8 8 8 ...
$ region : Factor w/ 8 levels "Aggregates","East Asia & Pacific",..: 1 1 1 1 1 1 1 1 1 1 ...
$ capital : Factor w/ 212 levels "","Abu Dhabi",..: 1 1 1 1 1 1 1 1 1 1 ...
$ longitude : Factor w/ 212 levels "","-0.126236",..: 1 1 1 1 1 1 1 1 1 1 ...
$ latitude : Factor w/ 212 levels "","-0.229498",..: 1 1 1 1 1 1 1 1 1 1 ...
$ income : Factor w/ 5 levels "Aggregates","High income",..: 1 1 1 1 1 1 1 1 1 1 ...
$ lending : Factor w/ 5 levels "Aggregates","Blend",..: 1 1 1 1 1 1 1 1 1 1 ...
metal_trade_dt
NULL
#there are string data, numeral data and factor data
#there are no date data even though the year represents a date.
metal_trade_net$year <- ymd(metal_trade_net$year, truncated = 2L)
cat(class(metal_trade_net$year)," : ", toString(head(metal_trade_net$year)))
Date : 1960-01-01, 1961-01-01, 1962-01-01, 1963-01-01, 1964-01-01, 1965-01-01
#years variable changed to date datatype
attributes(metal_trade_net$income)
$levels
[1] "Aggregates" "High income" "Low income" "Lower middle income"
[5] "Upper middle income"
$class
[1] "factor"
metal_trade_net$income <- ordered(metal_trade_net$income, levels = c("Aggregates", "Low income", "Lower middle income", "Upper middle income", "High income"))
tail(metal_trade_net$income)
[1] Low income Low income Low income Low income Low income Low income
Levels: Aggregates < Low income < Lower middle income < Upper middle income < High income
#ordering applied to factor data
This dataset is perceived as Tidy which is not a surprise given that it is from the World bank. For a dataset to be tidy it needs to have all it’s variables in columns, all its observations in specific rows and each value having it’s own cell in the dataframe. The third ruke for tidy datasets that each cell only have it’s own value is checked using the group by, sumarise and distinct value count functions from the dplyr module.
#income and lending variables curiously may be includding multiple variables within the same dataframe
metal_duplic_test<- metal_trade_net %>% group_by(country, year) %>% summarise(n = n())
metal_duplic_test <- metal_duplic_test[order(metal_duplic_test$n, decreasing = TRUE),]
metal_duplic_test %>% group_by(n) %>% summarise(distinct_values = n_distinct(n))
It is seen that each instance of a country and year are individual rows with unique values. Therefor it is confirmed that the dataset is in the tidy format.
With the collection of imported metals and exported metals the value for net exported metals can be found.This requires the inclusion of total merchandise imports and merchandise exports so that the metal net exports relative to merchandise exports may be found.
the following formulas are surmized to present net metal exports relative to net merchandise exports. This is to replicate the standard of the world bank to present specific trade of ore and metal as a percentage of a nations annual whole exports or whole imports
net metal export: mmtl NX = (metal_export x merch_export)/100-(metal import * merch_import)/100 net merchandise export: merch NX = (merch_export)-(merch_import) relative net metal export: mmtl RNX =mmtl NX / merch_exports
The number of missing values in each collumn is perceived using the colsums function. Then the number of rows with fully missing trade data or partially mising trade data is checked using the sum and is.na functions. THe fully missing metal trade data is dropped, the partially missing trade data is filled as 0. Cases like missing values for the extra data like capital or income are dropped as is cases where the merchandise trade data is not pressent. Then the dependant variables like net exports are recalculated.
colSums(is.na(metal_trade_net))
iso2c country year metal_imports metal_exports metal_rnx merch_imports merch_exports
0 0 0 6941 7114 7378 2836 2784
merch_nx metal_nx iso3c region capital longitude latitude income
2836 7378 236 236 236 236 236 236
lending
236
the majority of data with missing values stems from the independant values of metal imports/exports
paste("The number of observations without metal trade data: ",
sum(is.na(metal_trade_net$metal_imports & metal_trade_net$metal_exports)))
[1] "The number of observations without metal trade data: 7256"
paste("The number of observations without trade data: ",
sum(is.na(metal_trade_net$metal_imports & metal_trade_net$metal_exports & metal_trade_net$merch_exports & metal_trade_net$merch_imports)))
[1] "The number of observations without trade data: 7368"
paste("the number of observations with metal import data but not metal export data: ",
count(metal_trade_net %>% select(metal_imports, metal_exports) %>% filter(is.na(metal_exports), !is.na(metal_imports)) ))
[1] "the number of observations with metal import data but not metal export data: 315"
paste ("the number of observations with metal export data but not metal import data: ",
count(metal_trade_net %>% select(metal_imports, metal_exports) %>% filter(is.na(metal_imports), !is.na(metal_exports)) ))
[1] "the number of observations with metal export data but not metal import data: 142"
paste("total number of observations: ",
metal_trade_net %>% count())
[1] "total number of observations: 15576"
7256/15576 tuples are missing either export or import data. So about half. 6941/15576 hold neither metal import or export data. 7368/15576 tuples hold no data for all trade variables.
apart from instances where metal trade data is present in 1 dimension, the rest of the missing data must be dropped. 1 dimensional metal trade data can have null values be set to 0. THis is based on the assumption that there was zero occurence of the non tracked data.
metal_trade_net <- metal_trade_net %>% filter((!is.na(metal_exports)&!is.na(metal_imports))
|(!is.na(metal_exports)&is.na(metal_imports))
|is.na(metal_exports)&!is.na(metal_imports))
#data without any metal trade data has been dropped
metal_trade_net$metal_imports[is.na(metal_trade_net$metal_imports)] <-0
metal_trade_net$metal_exports[is.na(metal_trade_net$metal_exports)] <-0
#data with 1 metal trade factor is mutated
colSums(is.na(metal_trade_net ))
iso2c country year metal_imports metal_exports metal_rnx merch_imports merch_exports
0 0 0 0 0 579 130 130
merch_nx metal_nx iso3c region capital longitude latitude income
130 579 108 108 108 108 108 108
lending
108
The merchandise trade data is still filled with nulls
metal_trade_net %>% select(year, country, metal_exports, metal_rnx, merch_imports, merch_exports) %>% filter(is.na(merch_imports)|is.na(merch_exports))
it is apparent that data without merch trade values will not be able to determine metal_rnx. they only represent 130 values. they can be dropped without impacting the dataset.
metal_trade_net<- metal_trade_net %>% filter(!is.na(merch_imports), !is.na(merch_exports))
#rows without extra information are dropped as well
metal_trade_net <- metal_trade_net %>% filter(!is.na(income))
colSums(is.na(metal_trade_net ))
iso2c country year metal_imports metal_exports metal_rnx merch_imports merch_exports
0 0 0 0 0 445 0 0
merch_nx metal_nx iso3c region capital longitude latitude income
0 445 0 0 0 0 0 0
lending
0
The dependant mutated data of {metal_rnx, merch_nx, metal_nx} needs to be refreshed
metal_trade_net<-within(metal_trade_net, metal_nx <-(metal_exports*merch_exports)/100-(metal_imports * merch_imports)/100)
metal_trade_net <-within(metal_trade_net, merch_nx <- merch_exports-merch_imports)
metal_trade_net <-within(metal_trade_net, metal_rnx <- (metal_nx/merch_exports)*100)
colSums(is.na(metal_trade_net))
iso2c country year metal_imports metal_exports metal_rnx merch_imports merch_exports
0 0 0 0 0 0 0 0
merch_nx metal_nx iso3c region capital longitude latitude income
0 0 0 0 0 0 0 0
lending
0
##all missing values have been dealt with.
paste("the number of rows in the updated dataset is: ", count(metal_trade_net) )
[1] "the number of rows in the updated dataset is: 8539"
In this step the data that is an outlier as a univariate of a trade variable is viszualised. THe outliers for multivariate combinations of trade variables and income level or year or country are also visualiszed. It is recommended that outliers be deffined based on a multivariate basis with country or year. Outliers have not been changed.
metal_trade_net_sub <- metal_trade_net %>% select(country, year, metal_imports, metal_exports, metal_rnx, income)
merch_trade_net_sub <- metal_trade_net %>% select(country, year, merch_imports, merch_exports, merch_nx, metal_nx, income)
summary(metal_trade_net_sub %>% select(-year, -country, -income))
metal_imports metal_exports metal_rnx
Min. : 0.000 Min. : 0.0000 Min. : -Inf
1st Qu.: 1.074 1st Qu.: 0.7118 1st Qu.: -1.7875
Median : 2.034 Median : 2.9284 Median : -0.4374
Mean : 2.620 Mean : 7.7990 Mean : -Inf
3rd Qu.: 3.578 3rd Qu.: 6.8281 3rd Qu.: 3.3954
Max. :40.369 Max. :158.4931 Max. :156.7871
metal_trade_net_sub %>% select(-year, -country, -income) %>% boxplot(main = "Box Plot of metal trade data", ylab = "% of exports or imports")
summary(merch_trade_net_sub %>% select(-year, -country, - income))
merch_imports merch_exports merch_nx metal_nx
Min. :3.000e+06 Min. :0.000e+00 Min. :-1.081e+12 Min. :-2.304e+11
1st Qu.:1.041e+09 1st Qu.:6.815e+08 1st Qu.:-2.800e+09 1st Qu.:-1.592e+08
Median :7.125e+09 Median :5.664e+09 Median :-2.824e+08 Median :-1.710e+06
Mean :2.938e+11 Mean :2.919e+11 Mean :-1.897e+09 Mean :-9.251e+08
3rd Qu.:6.624e+10 3rd Qu.:6.777e+10 3rd Qu.: 3.318e+08 3rd Qu.: 1.708e+08
Max. :1.912e+13 Max. :1.911e+13 Max. : 8.754e+11 Max. : 1.249e+11
merch_trade_net_sub %>% select(-year, -country, - income) %>% boxplot( main = "Box Plot of merchandise trade data", ylab = "USD")
boxplot(metal_trade_net_sub$metal_exports ~ metal_trade_net_sub$country)
boxplot(metal_trade_net_sub$metal_imports ~ metal_trade_net_sub$country)
#the majority of data falls within a certain banding.
boxplot(metal_trade_net_sub$metal_rnx ~ metal_trade_net_sub$country)
boxplot(metal_trade_net_sub$metal_exports ~ metal_trade_net_sub$income)
boxplot(metal_trade_net_sub$metal_imports ~ metal_trade_net_sub$income)
boxplot(metal_trade_net_sub$metal_imports ~ metal_trade_net_sub$year)
It is apparent that the outliers for the univariate trade data can be dropped as can the multivariate outliers for trade data specific to country and trade data specific to year. Thease have clear outliers to remove. Capping would be the best way to manage this.