# Output message of this part is hidden to reduce the length of the report
library(readxl)
library(stringr)
library(Hmisc)
library(editrules)
library(forecast)
library(lubridate)
library(dplyr)
library(tidyr)
The dataset containing development indicators of countries in the years from 1960 to 2020 was drawn from World Bank Open Data.
First, the required libraries were loaded and the dataset was imported and subset for the purpose of preprocessing.
After understanding the variables thoroughly, categorical variables Region and Income Group were converted from characters to factors.
In the dataset, column names of 1960 to 2020 were values of the variable “Year”, while column Indicator Code contains variable names instead of observations. Therefore, this dataset was untidy and was tidied using the tidyr package. The Year variable produced in this process was converted form character to Date. And further subset was performed to focus on some populational indicators in 2019.
Also, a numeric variable Gender_ratio was created and computed in order to compare the male to female ratio in different countries.
Then the dataset was scanned for any missing or special values and obvious inconsistency. Some missing values were found in this scan and dealt with in different ways according to the nature of the variables.
A second scan was performed to check for outliers. The outliers were inspected, but due to the nature of the dataset, they were left untreated.
Finally, log transformations and a Box-Cox transformation was performed on numeric variables for better understanding and illustration of this dataset.
The dataset is drawn from World Bank Open Data(https://datacatalog.worldbank.org/dataset/world-development-indicators). This dataset contains the primary World Bank collection of development indicators, compiled from officially-recognized international sources. It presents the most current and accurate global development data available, and includes national, regional and global estimates.
This dataset is licensed under CC-BY 4.0, which allows users to copy, modify and distribute data in any format for any purpose.
This dataset comes in the form of a “WDIEXCEL.xlsx”. Only two sheets “Country” and “Data” are used in the preprocessing.
Although this sheet contains 30 variables, only some of them, as listed below, are relevant for the purposes of this preprocessing.
1.Country Code: Unique three-letter codes of a country.
2.Currency Unit: The unit of a country’s official currency.
3.Special Notes: Notable details of the rows.
4.Region: A categorical variable that indicates the geographic region group of a country.
5.Income Group: A categorical variable that indicates income level of a country, divided in 4 income groups.
Other variables are dropped after reading the file.
This sheet contains following variables:
1.Country Name: Name of a country.
2.Country Code: Unique three-letter codes of a country, same as in “Country”.
3.Indicator Name: Name of World Development Indicators.
This variable will be dropped, because after tidying, the corresponding Indicator Code will be used as column names, making this variable superfluous.
4.Indicator Code: Short code of World Development Indicators.
1960-2020: Numeric variables that indicates the value of each indicator for each country in each year.
Note: There are rows for aggregate and indexing purpose in both sheets. These rows will be removed in “Scan I” part.
# Read the file
country <- read_excel("WDIEXCEL.xlsx",sheet ="Country")
data <- read_excel("WDIEXCEL.xlsx",sheet ="Data")
# Check the imported datasets
head(country)
head(data)
# Subset the datasets to remove unneeded variables
country <- country %>% select(`Country Code`, `Short Name`,
`Currency Unit`,`Special Notes`,Region,`Income Group`)
data <- data %>% select(-`Indicator Name`)
# Check the datasets after subsetting
head(country)
head(data)
# Merge the datasets with the common key `Country Code`
# Only two columns in "country" will be joined to "wdi", the purpose of the other columns will be explained in Scan I
wdi <- data %>% left_join(select(country,`Country Code`, Region,`Income Group`),
by="Country Code")
# Check the resulted dataset to be processed further
head(wdi)
The resulted wdi dataset contains 66 variables.
Among them, Country Name and Country Code are character variables with values unique to each country.
Indicator Code are character variables with values unique to each World Development Indicator.
1960 to 2020 are numeric variables containing the values of each World Development Indicator of each country in each year.
Region and Income Group are character variables with values indicating region and income groups, which should be converted to factors.
# Check the structure of wdi, `1960` to `2019`is omitted to reduce the length of the report
str(wdi %>% select(-(`1960`:`2019`)))
## tibble [379,368 x 6] (S3: tbl_df/tbl/data.frame)
## $ Country Name : chr [1:379368] "Arab World" "Arab World" "Arab World" "Arab World" ...
## $ Country Code : chr [1:379368] "ARB" "ARB" "ARB" "ARB" ...
## $ Indicator Code: chr [1:379368] "EG.CFT.ACCS.ZS" "EG.ELC.ACCS.ZS" "EG.ELC.ACCS.RU.ZS" "EG.ELC.ACCS.UR.ZS" ...
## $ 2020 : num [1:379368] NA NA NA NA NA NA NA NA NA NA ...
## $ Region : chr [1:379368] NA NA NA NA ...
## $ Income Group : chr [1:379368] NA NA NA NA ...
# Convert `Region` to factor, `Income Group` to ordered factor
wdi <- wdi %>% mutate( Region=factor(Region),
`Income Group`=factor(`Income Group`,
levels = c("Low income","Lower middle income",
"Upper middle income","High income"),
ordered=TRUE))
# Check the structure of wdi after conversion, `1960` to `2019`is omited to reduce the length of the report
str(wdi %>% select(-(`1960`:`2019`)))
## tibble [379,368 x 6] (S3: tbl_df/tbl/data.frame)
## $ Country Name : chr [1:379368] "Arab World" "Arab World" "Arab World" "Arab World" ...
## $ Country Code : chr [1:379368] "ARB" "ARB" "ARB" "ARB" ...
## $ Indicator Code: chr [1:379368] "EG.CFT.ACCS.ZS" "EG.ELC.ACCS.ZS" "EG.ELC.ACCS.RU.ZS" "EG.ELC.ACCS.UR.ZS" ...
## $ 2020 : num [1:379368] NA NA NA NA NA NA NA NA NA NA ...
## $ Region : Factor w/ 7 levels "East Asia & Pacific",..: NA NA NA NA NA NA NA NA NA NA ...
## $ Income Group : Ord.factor w/ 4 levels "Low income"<"Lower middle income"<..: NA NA NA NA NA NA NA NA NA NA ...
In wdi dataset, column names of 1960 to 2020 are values of the variable “Year”, while column Indicator Code contains variable names instead of observations. Therefore, the tidy data principles “each observation must have its own row” and “each variable must have its own column” are violated. In other words, wdi is untidy.
To tidy this dataset, 1960 to 2020 are gathered intoYear and then Indicator Code is spread in order to form a tidy dataset. The Year variable is then converted form character to date format, and further subset is performed to focus on indicators SP.POP.TOTL( total population of a country ), SP.POP.TOTL.MA.IN( male population of a country ), SP.POP.TOTL.FE.IN( female population of a country ) in the year 2019.
# Gather() and spread() are used to tidy wdi
wdi <- wdi %>% gather(`1960`:`2020`,key="Year",value = "value")%>%
spread(key=`Indicator Code`,value = value)
# Convert characteristics `Year` to Date
wdi$Year <- as.Date(paste(wdi$Year,"01","01"), format="%Y%m%d")
# Further subset wdi in 2019
wdi_2019 <- wdi %>% select(`Country Name`,`Country Code`,Region,`Income Group`,
Year,`SP.POP.TOTL`,`SP.POP.TOTL.MA.IN`,`SP.POP.TOTL.FE.IN`) %>%
filter(year(Year)==2019)
# Check wdi_2019
head(wdi_2019)
In order to compare the male to female ratio in different countries, a numeric variable Gender_ratio is created and computed from male and female population.
# Create and compute `Gender_ratio` column
wdi_2019 <- wdi_2019 %>% mutate(Gender_ratio=SP.POP.TOTL.MA.IN/SP.POP.TOTL.FE.IN)
# Check wdi_2019 after tidy and manipulation
head(wdi_2019)
In order to scan the wdi_2019 dataset for missing and special values ( i.e. NaN, -Inf and Inf ), some functions are created, and then applied to the wdi_2019 dataset.
From the output of the functions, it appears that there are missing values in Region, Income Group and the numeric variables, while no special value presents in the dataset.
Further inspection shows that the missing values in Region, Income Group are caused by the aggregate and indexing rows that are not actual observation which should be excluded anyway.
After excluding the aggregate and indexing rows, only the country Eritrea has missing value in the SP.POP.TOTL( total population of a country ) variable. Thus, this value is simply computed from the difference of total world population and the sum population of other countries in 2019.
Then the missing Gender_ratio values are replaced with the mean Gender_ratio of all the other countries using the Hmisc package. And the missing male and female population are computed accordingly.
Finally, editrules package is used to make sure there is no obvious inconsistency in the dataset. Although due to editrules’s failure to support variables with space, “/” operator or even the multiplication between two variables, only simple rules are applied.
# Function for checking missing and special values.
check_na_special <- function(x){
a <- sum(is.na(x))
b <- sum(is.nan(x))
c <- sum(is.infinite(x))
return(c(a,b,c))
}
# Function for displaying the number and proportion of missing and special values.
num_na_special <- function(x){
print(cbind(c("num_NA","num_NaN","num_Inf"),
sapply(x,check_na_special)))
}
pro_na_special <- function(x){
print(cbind(c("pro_NA","pro_NaN","pro_Inf"),
sapply(x,function(x) check_na_special(x)/length(x))))
}
# Function for printing rows with missing values in specific columns
print_na <- function(df,cols){
vec <- rep(FALSE,times=nrow(df))
for (col in cols) {
for (row in 1:nrow(df)) {
if(!vec[row]){
if(is.na(df[row,col])) {
vec[row] <- TRUE
}
}
}
}
head(df[vec,],n=sum(vec))
}
# Check missing and special values in wdi_2019
wdi_2019 %>% num_na_special()
## Country Name Country Code Region Income Group Year SP.POP.TOTL
## [1,] "num_NA" "0" "0" "47" "47" "0" "2"
## [2,] "num_NaN" "0" "0" "0" "0" "0" "0"
## [3,] "num_Inf" "0" "0" "0" "0" "0" "0"
## SP.POP.TOTL.MA.IN SP.POP.TOTL.FE.IN Gender_ratio
## [1,] "25" "25" "25"
## [2,] "0" "0" "0"
## [3,] "0" "0" "0"
wdi_2019 %>% pro_na_special()
## Country Name Country Code Region
## [1,] "pro_NA" "0" "0" "0.178030303030303"
## [2,] "pro_NaN" "0" "0" "0"
## [3,] "pro_Inf" "0" "0" "0"
## Income Group Year SP.POP.TOTL SP.POP.TOTL.MA.IN
## [1,] "0.178030303030303" "0" "0.00757575757575758" "0.0946969696969697"
## [2,] "0" "0" "0" "0"
## [3,] "0" "0" "0" "0"
## SP.POP.TOTL.FE.IN Gender_ratio
## [1,] "0.0946969696969697" "0.0946969696969697"
## [2,] "0" "0"
## [3,] "0" "0"
# Inspect missing values in Region and `Income Group`
print_na(wdi_2019,c("Region","Income Group"))
# Record the World total population for later computation
world_pop <- wdi_2019$SP.POP.TOTL[wdi_2019$`Country Name`=="World"]
# In order to identify the aggregate rows,the country dataset is scanned
# The aggregate rows are those with the word "aggregate" in `Special Notes` or do not have a `Currency Unit`
# Their `Country Code` and `Short Name` are selected to create a lookup table
aggregate_table <- country %>% filter(str_detect(country$`Special Notes`,pattern="aggregate")
|is.na(country$`Currency Unit`)) %>%
select(`Country Code`,`Short Name`) %>% unique()
# Check the lookup table
print(aggregate_table)
## # A tibble: 46 x 2
## `Country Code` `Short Name`
## <chr> <chr>
## 1 ARB Arab World
## 2 CEB Central Europe and the Baltics
## 3 CSS Caribbean small states
## 4 EAP East Asia & Pacific (excluding high income)
## 5 EAR Early-demographic dividend
## 6 EAS East Asia & Pacific
## 7 ECA Europe & Central Asia (excluding high income)
## 8 ECS Europe & Central Asia
## 9 EMU Euro area
## 10 EUU European Union
## # ... with 36 more rows
# Exclude rows with `Country Code` in the lookup table or with`Country Code` == "INX" which indicates the indexing rows
wdi_2019 <- wdi_2019 %>% filter(!(`Country Code` %in% aggregate_table$`Country Code`|`Country Code` == "INX") )
# Check the number of missing and special values after removing the superfluous rows
wdi_2019 %>% num_na_special()
## Country Name Country Code Region Income Group Year SP.POP.TOTL
## [1,] "num_NA" "0" "0" "0" "0" "0" "1"
## [2,] "num_NaN" "0" "0" "0" "0" "0" "0"
## [3,] "num_Inf" "0" "0" "0" "0" "0" "0"
## SP.POP.TOTL.MA.IN SP.POP.TOTL.FE.IN Gender_ratio
## [1,] "24" "24" "24"
## [2,] "0" "0" "0"
## [3,] "0" "0" "0"
# It appears that the only country with missing SP.POP.TOTL is Eritrea
print_na(wdi_2019, "SP.POP.TOTL")
# Impute the missing SP.POP.TOTL
wdi_2019[is.na(wdi_2019$SP.POP.TOTL),"SP.POP.TOTL"]<- world_pop - sum(wdi_2019$SP.POP.TOTL,na.rm = TRUE)
# Inspect missing values in other numeric variables
print_na(wdi_2019,c("SP.POP.TOTL","SP.POP.TOTL.MA.IN","SP.POP.TOTL.FE.IN","Gender_ratio"))
# Impute the missing Gender_ratio with column mean
wdi_2019$Gender_ratio <- impute(wdi_2019$Gender_ratio,fun=mean)
# Compute missing female population from total population and Gender_ratio
for(x in 1:length(wdi_2019$SP.POP.TOTL.FE.IN)) {
if(is.na(wdi_2019$SP.POP.TOTL.FE.IN[x])) {
wdi_2019$SP.POP.TOTL.FE.IN[x] <-
round(wdi_2019$SP.POP.TOTL[x]/(1+wdi_2019$Gender_ratio[x]))
}
else {next}
}
# Compute missing male population from total population and female population
for(x in 1:length(wdi_2019$SP.POP.TOTL.MA.IN)){
if(is.na(wdi_2019$SP.POP.TOTL.MA.IN[x])) {
wdi_2019$SP.POP.TOTL.MA.IN[x] <-
round(wdi_2019$SP.POP.TOTL[x]-wdi_2019$SP.POP.TOTL.FE.IN[x])
}
else {next}
}
# Check all the missing values have been dealt with
wdi_2019 %>% num_na_special()
## Country Name Country Code Region Income Group Year SP.POP.TOTL
## [1,] "num_NA" "0" "0" "0" "0" "0" "0"
## [2,] "num_NaN" "0" "0" "0" "0" "0" "0"
## [3,] "num_Inf" "0" "0" "0" "0" "0" "0"
## SP.POP.TOTL.MA.IN SP.POP.TOTL.FE.IN Gender_ratio
## [1,] "0" "0" "0"
## [2,] "0" "0" "0"
## [3,] "0" "0" "0"
# Check obvious inconsistencies or errors
# Import rules
rules <- editfile("Rules.txt", type = "all")
# Check rules
rules
##
## Data model:
## dat1 : Region %in% c('East Asia & Pacific', 'Europe & Central Asia', 'Latin America & Caribbean', 'Middle East & North Africa', 'North America', 'South Asia', 'Sub-Saharan Africa')
##
## Edit set:
## num1 : 0 < SP.POP.TOTL
## num2 : 0 < SP.POP.TOTL.MA.IN
## num3 : 0 < SP.POP.TOTL.FE.IN
## num4 : SP.POP.TOTL == SP.POP.TOTL.MA.IN + SP.POP.TOTL.FE.IN
# Check violations
summary(violatedEdits(rules, wdi_2019))
## No violations detected, 0 checks evaluated to NA
## NULL
The numeric variables SP.POP.TOTL, SP.POP.TOTL.MA.IN, SP.POP.TOTL.FE.IN and Gender_ratio are scanned for outliers in their Income Gorup using bivariate box plot, and the outliers are printed for inspection.
Unsurprisingly, outliers in the populational variables are countries with larger population in their income groups, e.g. China, India, etc.
Gender_ratio shows a more interesting result. Most of the countries with a low male to female ratio in their income groups are in Eastern Europe, while most of the countries with a high ratio are in Middle East. This may indicate that gender ratio of a country is affected by its culture.
For every country has its own characteristics, no outlier is modified in this preprocessing.
# Boxplot and print outliers of each numeric variable
for( x in c("SP.POP.TOTL","SP.POP.TOTL.MA.IN","SP.POP.TOTL.FE.IN","Gender_ratio")){
boxplot <- boxplot(unlist(wdi_2019[,x])~wdi_2019$`Income Group`,
xlab ="Income Group", ylab = x)
print(wdi_2019 %>% select(`Country Name`,Region,`Income Group`,Year,!!as.name(x)) %>%
filter(!!as.name(x) %in% boxplot$out),n=length(boxplot$out))
}
## # A tibble: 28 x 5
## `Country Name` Region `Income Group` Year SP.POP.TOTL
## <chr> <fct> <ord> <date> <dbl>
## 1 Australia East Asia & Pacif~ High income 2019-01-01 25364307
## 2 Bangladesh South Asia Lower middle inc~ 2019-01-01 163046161
## 3 Brazil Latin America & C~ Upper middle inc~ 2019-01-01 211049527
## 4 Canada North America High income 2019-01-01 37589262
## 5 China East Asia & Pacif~ Upper middle inc~ 2019-01-01 1397715000
## 6 Congo, Dem. Rep. Sub-Saharan Africa Low income 2019-01-01 86790567
## 7 Egypt, Arab Rep. Middle East & Nor~ Lower middle inc~ 2019-01-01 100388073
## 8 Ethiopia Sub-Saharan Africa Low income 2019-01-01 112078730
## 9 France Europe & Central ~ High income 2019-01-01 67059887
## 10 Germany Europe & Central ~ High income 2019-01-01 83132799
## 11 India South Asia Lower middle inc~ 2019-01-01 1366417754
## 12 Indonesia East Asia & Pacif~ Upper middle inc~ 2019-01-01 270625568
## 13 Iran, Islamic Re~ Middle East & Nor~ Upper middle inc~ 2019-01-01 82913906
## 14 Italy Europe & Central ~ High income 2019-01-01 60297396
## 15 Japan East Asia & Pacif~ High income 2019-01-01 126264931
## 16 Korea, Rep. East Asia & Pacif~ High income 2019-01-01 51709098
## 17 Mexico Latin America & C~ Upper middle inc~ 2019-01-01 127575529
## 18 Nigeria Sub-Saharan Africa Lower middle inc~ 2019-01-01 200963599
## 19 Pakistan South Asia Lower middle inc~ 2019-01-01 216565318
## 20 Philippines East Asia & Pacif~ Lower middle inc~ 2019-01-01 108116615
## 21 Poland Europe & Central ~ High income 2019-01-01 37970874
## 22 Russian Federati~ Europe & Central ~ Upper middle inc~ 2019-01-01 144373535
## 23 Saudi Arabia Middle East & Nor~ High income 2019-01-01 34268528
## 24 Spain Europe & Central ~ High income 2019-01-01 47076781
## 25 Turkey Europe & Central ~ Upper middle inc~ 2019-01-01 83429615
## 26 United Kingdom Europe & Central ~ High income 2019-01-01 66834405
## 27 United States North America High income 2019-01-01 328239523
## 28 Vietnam East Asia & Pacif~ Lower middle inc~ 2019-01-01 96462106
## # A tibble: 28 x 5
## `Country Name` Region `Income Group` Year SP.POP.TOTL.MA.~
## <chr> <fct> <ord> <date> <dbl>
## 1 Australia East Asia & Pa~ High income 2019-01-01 12631543
## 2 Bangladesh South Asia Lower middle in~ 2019-01-01 82473785
## 3 Brazil Latin America ~ Upper middle in~ 2019-01-01 103733164
## 4 Canada North America High income 2019-01-01 18651974
## 5 China East Asia & Pa~ Upper middle in~ 2019-01-01 717118675
## 6 Congo, Dem. Rep. Sub-Saharan Af~ Low income 2019-01-01 43319103
## 7 Egypt, Arab Rep. Middle East & ~ Lower middle in~ 2019-01-01 50722597
## 8 Ethiopia Sub-Saharan Af~ Low income 2019-01-01 56069010
## 9 France Europe & Centr~ High income 2019-01-01 32458740
## 10 Germany Europe & Centr~ High income 2019-01-01 41059335
## 11 India South Asia Lower middle in~ 2019-01-01 710129571
## 12 Indonesia East Asia & Pa~ Upper middle in~ 2019-01-01 136269762
## 13 Iran, Islamic R~ Middle East & ~ Upper middle in~ 2019-01-01 41889897
## 14 Italy Europe & Centr~ High income 2019-01-01 29338084
## 15 Japan East Asia & Pa~ High income 2019-01-01 61659571
## 16 Korea, Rep. East Asia & Pa~ High income 2019-01-01 25891693
## 17 Mexico Latin America ~ Upper middle in~ 2019-01-01 62403393
## 18 Nigeria Sub-Saharan Af~ Lower middle in~ 2019-01-01 101831872
## 19 Pakistan South Asia Lower middle in~ 2019-01-01 111447582
## 20 Philippines East Asia & Pa~ Lower middle in~ 2019-01-01 54316068
## 21 Poland Europe & Centr~ High income 2019-01-01 18400811
## 22 Russian Federat~ Europe & Centr~ Upper middle in~ 2019-01-01 66908372
## 23 Saudi Arabia Middle East & ~ High income 2019-01-01 19783531
## 24 Spain Europe & Centr~ High income 2019-01-01 23127681
## 25 Turkey Europe & Centr~ Upper middle in~ 2019-01-01 41173713
## 26 United Kingdom Europe & Centr~ High income 2019-01-01 33007819
## 27 United States North America High income 2019-01-01 162417882
## 28 Vietnam East Asia & Pa~ Lower middle in~ 2019-01-01 48151351
## # A tibble: 28 x 5
## `Country Name` Region `Income Group` Year SP.POP.TOTL.FE.~
## <chr> <fct> <ord> <date> <dbl>
## 1 Australia East Asia & Pa~ High income 2019-01-01 12732764
## 2 Bangladesh South Asia Lower middle in~ 2019-01-01 80572376
## 3 Brazil Latin America ~ Upper middle in~ 2019-01-01 107316363
## 4 Canada North America High income 2019-01-01 18937288
## 5 China East Asia & Pa~ Upper middle in~ 2019-01-01 680596325
## 6 Congo, Dem. Rep. Sub-Saharan Af~ Low income 2019-01-01 43471464
## 7 Egypt, Arab Rep. Middle East & ~ Lower middle in~ 2019-01-01 49665476
## 8 Ethiopia Sub-Saharan Af~ Low income 2019-01-01 56009720
## 9 France Europe & Centr~ High income 2019-01-01 34601147
## 10 Germany Europe & Centr~ High income 2019-01-01 42073464
## 11 India South Asia Lower middle in~ 2019-01-01 656288183
## 12 Indonesia East Asia & Pa~ Upper middle in~ 2019-01-01 134355806
## 13 Iran, Islamic R~ Middle East & ~ Upper middle in~ 2019-01-01 41024009
## 14 Italy Europe & Centr~ High income 2019-01-01 30959312
## 15 Japan East Asia & Pa~ High income 2019-01-01 64605360
## 16 Korea, Rep. East Asia & Pa~ High income 2019-01-01 25817405
## 17 Mexico Latin America ~ Upper middle in~ 2019-01-01 65172136
## 18 Nigeria Sub-Saharan Af~ Lower middle in~ 2019-01-01 99131727
## 19 Pakistan South Asia Lower middle in~ 2019-01-01 105117736
## 20 Philippines East Asia & Pa~ Lower middle in~ 2019-01-01 53800547
## 21 Poland Europe & Centr~ High income 2019-01-01 19570063
## 22 Russian Federat~ Europe & Centr~ Upper middle in~ 2019-01-01 77465163
## 23 Saudi Arabia Middle East & ~ High income 2019-01-01 14484997
## 24 Spain Europe & Centr~ High income 2019-01-01 23949100
## 25 Turkey Europe & Centr~ Upper middle in~ 2019-01-01 42255902
## 26 United Kingdom Europe & Centr~ High income 2019-01-01 33826586
## 27 United States North America High income 2019-01-01 165821641
## 28 Vietnam East Asia & Pa~ Lower middle in~ 2019-01-01 48310755
## # A tibble: 16 x 5
## `Country Name` Region `Income Group` Year Gender_ratio
## <chr> <fct> <ord> <date> <dbl>
## 1 Afghanistan South Asia Low income 2019-01-01 1.05
## 2 Bahrain Middle East & No~ High income 2019-01-01 1.80
## 3 Belarus Europe & Central~ Upper middle in~ 2019-01-01 0.871
## 4 Bhutan South Asia Lower middle in~ 2019-01-01 1.13
## 5 Djibouti Middle East & No~ Lower middle in~ 2019-01-01 1.11
## 6 El Salvador Latin America & ~ Lower middle in~ 2019-01-01 0.881
## 7 Equatorial Guinea Sub-Saharan Afri~ Upper middle in~ 2019-01-01 1.25
## 8 Kuwait Middle East & No~ High income 2019-01-01 1.56
## 9 Maldives South Asia Upper middle in~ 2019-01-01 1.72
## 10 Nepal South Asia Lower middle in~ 2019-01-01 0.838
## 11 Oman Middle East & No~ High income 2019-01-01 1.94
## 12 Qatar Middle East & No~ High income 2019-01-01 3.05
## 13 Russian Federation Europe & Central~ Upper middle in~ 2019-01-01 0.864
## 14 Saudi Arabia Middle East & No~ High income 2019-01-01 1.37
## 15 Ukraine Europe & Central~ Lower middle in~ 2019-01-01 0.863
## 16 United Arab Emira~ Middle East & No~ High income 2019-01-01 2.25
Population of countries vary from tens of thousands to billions. Therefore, the populational variables are highly right skewed. For better understanding, log transformation with base 10 is applied on these variables. Moreover, Box-Cox transformation is applied on Gender_ratio to normalise it.
New variables are created at this step to preserve original data for further use.
# Check the skewness of polulation values.
for(x in c("SP.POP.TOTL","SP.POP.TOTL.MA.IN","SP.POP.TOTL.FE.IN")){
hist(wdi_2019[,x])
}
# Apply base 10 log transformation
wdi_2019 <- wdi_2019 %>% mutate(SP.POP.TOTL_log = log10(SP.POP.TOTL),
SP.POP.TOTL.MA.IN_log = log10(SP.POP.TOTL.MA.IN),
SP.POP.TOTL.FE.IN_log = log10(SP.POP.TOTL.FE.IN))
# Check the skewness after log transformation.
for(x in c("SP.POP.TOTL_log","SP.POP.TOTL.MA.IN_log","SP.POP.TOTL.FE.IN_log")){
hist(wdi_2019[,x])
}
# Check the distribution of Gender_ratio
hist(wdi_2019$"Gender_ratio")
# Normalise the Gender_ratio with Box-Cox transformation
wdi_2019$Gender_ratio_BC <- BoxCox(wdi_2019$Gender_ratio,lambda = "auto")
# Check the distribution after Box-Cox transformation
hist(wdi_2019$Gender_ratio_BC)