Required packages

library(readr)
library(Hmisc)
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## Loading required package: ggplot2
## 
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
## 
##     format.pval, units
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:Hmisc':
## 
##     src, summarize
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr)
library(outliers)

Executive Summary

This report explains one of the major stake in Data Analysis which is Data Preprocessing. Hence the report adheres to the Data preprocessing framework which is defined by five major tasks i.e Get, Understand, Tidy & Manipulate, Scan and Transform. Three datasets were used, which are obtained from the online repository of The World Bank.The data is concised to select only the relevant information, those which deemed suffcient for the exercise.Each data set is inspected for missing values and rows which had more than two missing values are excluded. All the three datasets are merged and the variables accumulated are converted to appropriate datatypes.

The variables in the data are transformed to produce new variable and the available numeric variables are inspected for outliers and corrected accordingly.The important numeric variables in the dataset is exposed to logarithmic transformation to reduce the skewness which would help achieve better statistical results going forward.

Data

The three datasets used in the analysis are all potential indicators of GDP(in currents US$) , Exports of goods and servcies(in % of GDP) and metadata of income group for the countries assesed by world bank. The datatsets contain data from year 1960 to 2019, thus for the purpose of analysis only variables from 2009 to 2018 along with the indicators are selected to consider the countries latest progress.

The dataset of GDP per capita eplains the GDP of countries evaluated in current value of US$ , obtained from https://data.worldbank.org/indicator/NY.GDP.PCAP.CD?end=2018&start=1960&view=chart.

The variables are -

-Country: The name of countries

-Country Code: A three letter code representing the country name

-Indicator Name: Variable name of that row (GDP per capita (current US$))

-Indicator Code: Internal code for indicator name

-1960:2019: Year, column contains the GDP’s in USD

GDP <- read_csv("Project/GDP_CAP.csv", skip = 3)
## Warning: Missing column names filled in: 'X65' [65]
## Parsed with column specification:
## cols(
##   .default = col_double(),
##   `Country Name` = col_character(),
##   `Country Code` = col_character(),
##   `Indicator Name` = col_character(),
##   `Indicator Code` = col_character(),
##   `2019` = col_logical(),
##   X65 = col_logical()
## )
## See spec(...) for full column specifications.
GDP_Ten <- GDP[,c(1,2,54:63)]
head(GDP)

The dataset of Exports(in % GDP) explains the countries total export of goods and services in percentage of their GDP, obtained from https://data.worldbank.org/indicator/NE.EXP.GNFS.ZS

The variables are

-Country: The name of countries

-Country Code: A three letter code representing the country name

-Indicator Name: Variable name of that row (Exports of goods and services (% of GDP))

-Indicator Code: Internal code for indicator name

-1960:2019: Year, column contains the percentage of GDP in Exports’s

EXP <- read_csv("Project/EXP.csv", skip = 3)
## Warning: Missing column names filled in: 'X65' [65]
## Parsed with column specification:
## cols(
##   .default = col_double(),
##   `Country Name` = col_character(),
##   `Country Code` = col_character(),
##   `Indicator Name` = col_character(),
##   `Indicator Code` = col_character(),
##   `2019` = col_logical(),
##   X65 = col_logical()
## )
## See spec(...) for full column specifications.
EXP_Ten <-  EXP[,c(1,2,54:63)]
head(EXP)

The third dataset contains the details of Income groups of countries which was available as metadata with both the above downloaded data.For the further process only Country code and IncomeGroup are selected which deemed to be relevant for comparison with other tables.

The variables are -

-Country Code: A Three letter code representing the country name

-Region: Region of the country in terms of continent

-IncomeGroup: Income group of that country

-SpecialNote: Internal notes

-TableName: The name of countries

Country <-  read_csv("Project/Country.csv")
## Warning: Missing column names filled in: 'X6' [6]
## Parsed with column specification:
## cols(
##   `Country Code` = col_character(),
##   Region = col_character(),
##   IncomeGroup = col_character(),
##   SpecialNotes = col_character(),
##   TableName = col_character(),
##   X6 = col_logical()
## )
Country_income <- Country[,c(1,3)]
head(Country)

The rows are then inspected for types of missing values in variables using the unique function for variables GDP_Ten and EXP_Ten which form the basis of our analysis.

unique(GDP_Ten)
unique(EXP_Ten)

It is observed that the missing values are represented by NA.Thus the rows which have more than 2 NA values are removed since they are missing out on data in a greater extent.The Income Group variable is not manipulated since its a grouping factor for the countries and removal of rows or imputing the data cannot be justified.

GDP_NA <- GDP_Ten[rowSums(is.na(GDP_Ten)) <2,]
EXP_NA <- EXP_Ten[rowSums(is.na(EXP_Ten)) <2,]

Tidy & Manipulate Data I

Both datasets above does not follow the tidy data principles (Hadley Wickham and Grolemund (2016)) format as they are currently in the wide format. The years from both datasets will be gathered into a column to transform the data into the long form so that each observation can form a row to abide by the principles.

GDP_gather <- GDP_NA %>% gather(c(3:12),key = "Year",value = "GDP_USD" )
head(GDP_gather)
EXP_gather <- EXP_NA %>% gather(c(3:12),key = "Year", value = "EXP_PER_GDP(%)")
head(EXP_gather)

The gathered datasets of GDP and Export are merged using innerjoin while the merged dataset is joined with Country_income dataset using left join. The fully merged dataset is named as Joined_data. Merging the datasets together would make the comparison between variables from different dataset much easier.

GDP_Inf <- EXP_gather %>% inner_join(GDP_gather, by = c("Country Name","Country Code","Year"))

Joined_data <- GDP_Inf %>% left_join(Country_income, by = "Country Code")

head(Joined_data)

Understand

The Joined_data dataset consists of six variables out of which four are of character datatype and two are of numeric datatypes. For the application of statistical procedures each variable is very important to achieve efficient test results. The datatype of each variable is given using the structure function.

str(Joined_data)
## tibble [2,200 x 6] (S3: tbl_df/tbl/data.frame)
##  $ Country Name  : chr [1:2200] "Aruba" "Afghanistan" "Angola" "Albania" ...
##  $ Country Code  : chr [1:2200] "ABW" "AFG" "AGO" "ALB" ...
##  $ Year          : chr [1:2200] "2009" "2009" "2009" "2009" ...
##  $ EXP_PER_GDP(%): num [1:2200] 63.8 22.7 58.8 25.2 47.7 ...
##  $ GDP_USD       : num [1:2200] 24630 438 3123 4114 5183 ...
##  $ IncomeGroup   : chr [1:2200] "High income" "Low income" "Lower middle income" "Upper middle income" ...
typeof(Joined_data$`EXP_PER_GDP(%)`)
## [1] "double"
typeof(Joined_data$GDP_USD)
## [1] "double"

The varibales EXP_PER_GDP(%) & GDP_USD are of number class and they are of the correct datatype, double.The Country Name, Country Code should be of factor class while Year, and IncomeGroup should be ordinal factor class.Thus the variables are converted to their appropriate datatypes.

Income group variable is factored and levelled to categorize from Low - High income groups.Similarly Year variable is aso factored and levelled to categorize the years from 2009 - 2018. Country Name & Country Code are coverted as factors using the lapply function for the ease of handling.

Joined_data$IncomeGroup <- factor(Joined_data$IncomeGroup, levels = c("Low income", "Lower middle income", "Middle income", "Upper middle income", "High income"), ordered = TRUE)

Joined_data$Year <- factor(Joined_data$Year, levels = c(2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018), ordered = TRUE)

Joined_data[,c(1:2)] <- lapply(Joined_data[,c(1:2)],FUN = as.factor)

str(Joined_data)
## tibble [2,200 x 6] (S3: tbl_df/tbl/data.frame)
##  $ Country Name  : Factor w/ 220 levels "Afghanistan",..: 9 1 5 2 6 209 7 8 4 10 ...
##  $ Country Code  : Factor w/ 220 levels "ABW","AFG","AGO",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Year          : Ord.factor w/ 10 levels "2009"<"2010"<..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ EXP_PER_GDP(%): num [1:2200] 63.8 22.7 58.8 25.2 47.7 ...
##  $ GDP_USD       : num [1:2200] 24630 438 3123 4114 5183 ...
##  $ IncomeGroup   : Ord.factor w/ 5 levels "Low income"<"Lower middle income"<..: 5 1 2 4 NA 5 4 4 4 5 ...

Scan I

The variables in the dataset is scanned for missing values using the colSums() function as it is discovered during the iniatial scan that the datasets only contains NA values.Hence there no special values or obvious errors available.The colSum() function gives the number of missing values for each variable which aids in the determination of methodologies, to deal with them.

colSums(is.na(Joined_data))
##   Country Name   Country Code           Year EXP_PER_GDP(%)        GDP_USD 
##              0              0              0             15              5 
##    IncomeGroup 
##            420

It can be observed that GDP_USD and EXP_GDP contains missing values. Income group variable is avoided since its a factor variable and represents The mean for GDP_USD and EXP_PER_GDP(%) and the count of NA value is found for each country using group_by() function followed by summarise() function.

Joined_data %>%  group_by(`Country Name`) %>% summarise(Mean = mean(GDP_USD, na.rm = TRUE), Missing = sum(is.na(GDP_USD)))
Joined_data %>%  group_by(`Country Name`) %>% summarise(Mean = mean(`EXP_PER_GDP(%)`, na.rm = TRUE), Missing = sum(is.na(`EXP_PER_GDP(%)`)))

The NA values of each country will be replaced by the mean of that country using impute() function inside mutate(). Then the NA values for each variable is verified again.

Joined_one <-  Joined_data %>%  group_by(`Country Name`) %>% mutate(GDP_USD = impute(GDP_USD, fun = mean) ) %>% ungroup()
## Warning in mutate_impl(.data, dots, caller_env()): Vectorizing 'impute' elements
## may not preserve their attributes

## Warning in mutate_impl(.data, dots, caller_env()): Vectorizing 'impute' elements
## may not preserve their attributes

## Warning in mutate_impl(.data, dots, caller_env()): Vectorizing 'impute' elements
## may not preserve their attributes

## Warning in mutate_impl(.data, dots, caller_env()): Vectorizing 'impute' elements
## may not preserve their attributes

## Warning in mutate_impl(.data, dots, caller_env()): Vectorizing 'impute' elements
## may not preserve their attributes
group_by(Joined_one, `Country Name`) %>% summarise(n = n(), Missing = sum(is.na(GDP_USD)))
Joined_one <-  Joined_one %>%  group_by(`Country Name`) %>% mutate(`EXP_PER_GDP(%)` = impute(`EXP_PER_GDP(%)`, fun = mean) ) %>% ungroup()
## Warning in mutate_impl(.data, dots, caller_env()): Vectorizing 'impute' elements
## may not preserve their attributes

## Warning in mutate_impl(.data, dots, caller_env()): Vectorizing 'impute' elements
## may not preserve their attributes

## Warning in mutate_impl(.data, dots, caller_env()): Vectorizing 'impute' elements
## may not preserve their attributes

## Warning in mutate_impl(.data, dots, caller_env()): Vectorizing 'impute' elements
## may not preserve their attributes

## Warning in mutate_impl(.data, dots, caller_env()): Vectorizing 'impute' elements
## may not preserve their attributes

## Warning in mutate_impl(.data, dots, caller_env()): Vectorizing 'impute' elements
## may not preserve their attributes

## Warning in mutate_impl(.data, dots, caller_env()): Vectorizing 'impute' elements
## may not preserve their attributes

## Warning in mutate_impl(.data, dots, caller_env()): Vectorizing 'impute' elements
## may not preserve their attributes

## Warning in mutate_impl(.data, dots, caller_env()): Vectorizing 'impute' elements
## may not preserve their attributes

## Warning in mutate_impl(.data, dots, caller_env()): Vectorizing 'impute' elements
## may not preserve their attributes

## Warning in mutate_impl(.data, dots, caller_env()): Vectorizing 'impute' elements
## may not preserve their attributes

## Warning in mutate_impl(.data, dots, caller_env()): Vectorizing 'impute' elements
## may not preserve their attributes

## Warning in mutate_impl(.data, dots, caller_env()): Vectorizing 'impute' elements
## may not preserve their attributes

## Warning in mutate_impl(.data, dots, caller_env()): Vectorizing 'impute' elements
## may not preserve their attributes

## Warning in mutate_impl(.data, dots, caller_env()): Vectorizing 'impute' elements
## may not preserve their attributes
group_by(Joined_one, `Country Name`) %>% summarise(n = n(), Missing = sum(is.na(`EXP_PER_GDP(%)`)))

Tidy & Manipulate Data II

Since there are no accountable missing values another column Exports_USD can be created from EXP_PER_GDP(%) and GDP_USD values. The Exports_USD will give the annual value in Exports of goods and services in current USD, for the countries. The columns in new table are re-ordered using select() function.

Joined_new <- Joined_one %>% mutate(Exports_USD = (`EXP_PER_GDP(%)`/100)*GDP_USD)

Joined_new <- select(Joined_new, c(1,2,6,3,4,7,5))
head(Joined_new)

Scan II

Delaing with outliers is a crucial step of scanning process since the characteristics of results are directly related to outliers.Greater outliers can widely bias the results and hence should be avoided. The outliers in GDP_USD and Exports_USD are scanned using boxplot.According to the Tukey’s method of outlier detection, outliers are defined as the values in the data set that fall below −1.5×IQR or above 1.5×IQR.

Box <- Joined_new$GDP_USD %>% boxplot(main="Boxplot of GDP_USD with outliers", ylab = " GDP_USD", col = "SkyBlue" )

As there are many outliers in GDP_USD which needs to be treated, capping is done to adjust the outliers. Capping replaces the outliers with the nearest neighbours that are not outliers.In order to apply capping on the outliers,a user-defined function cap(taken from: Stackoverflow) is used.

cap <- function(x){
  quantiles <- quantile( x, c(.05, 0.25, 0.75, .95 ) )
  x[ x < quantiles[2] - 1.5*IQR(x) ] <- quantiles[1]
  x[ x > quantiles[3] + 1.5*IQR(x) ] <- quantiles[4]
  x
}

The treated data is inspected again by using boxplot.

Joined_new$GDP_USD <-  Joined_new$GDP_USD %>% cap()
Joined_new$GDP_USD %>% boxplot(main="Boxplot of GDP_USD without outliers", ylab = " GDP_USD", col = "SkyBlue")

GDP_USD is left with a single outlier which can be neglected.Similarly Exports_USD variable is also capped to adjust the outliers.

Box_two <- Joined_new$Exports_USD %>% boxplot(main="Boxplot of Exports_USD with outliers", ylab = " Exports_USD", col = "SkyBlue")

Joined_new$Exports_USD <-  Joined_new$Exports_USD %>% cap()
Joined_new$Exports_USD %>% boxplot(main="Boxplot of Exports_USD without outliers", ylab = " Exports_USD", col = "SkyBlue")

## Transform

Transforming the relevant variables is essential for statistical analysis for the application of statistical techniques as most of them prefer normal distribution over skewed distribution.The skewness of the variable can be checked using histograms.

hist(Joined_new$GDP_USD, main = "Histogram of GDP_USD", xlab = "GDP in USD", col = "SkyBlue")

It is observed that the histogram for GDP_USD is right-skewed, which is corrected by using the Log tranformation(log10) to make the distribution more symmetrical.

hist(Joined_new$GDP_USD, main = "Histogram of GDP_USD", xlab = "GDP_USD",, col = "SkyBlue")

hist(log10(Joined_new$GDP_USD) , main = "Histogram of log10(GDP_USD)",xlab = "log10(GDP_USD)",, col = "SkyBlue")

Similar logarithmic transformation approach is followed for EXP_USD.

hist(Joined_new$Exports_USD, main = "Histogram of Exports_USD",xlab = "Exports_USD",, col = "SkyBlue")

hist(log10(Joined_new$Exports_USD), main = "Histogram of log10(Exports_USD)", xlab = "log10(Exports_USD)",, col = "SkyBlue")