MATH2349 Data Wrangling

Required packages

# Output message of this part is hidden to reduce the length of the report
library(readxl)
library(stringr)
library(Hmisc)
library(editrules)
library(forecast)
library(lubridate)
library(dplyr)
library(tidyr)

Executive Summary

The dataset containing development indicators of countries in the years from 1960 to 2020 was drawn from World Bank Open Data.
First, the required libraries were loaded and the dataset was imported and subset for the purpose of preprocessing.
After understanding the variables thoroughly, categorical variables Region and Income Group were converted from characters to factors.
In the dataset, column names of 1960 to 2020 were values of the variable “Year”, while column Indicator Code contains variable names instead of observations. Therefore, this dataset was untidy and was tidied using the tidyr package. The Year variable produced in this process was converted form character to Date. And further subset was performed to focus on some populational indicators in 2019.
Also, a numeric variable Gender_ratio was created and computed in order to compare the male to female ratio in different countries.
Then the dataset was scanned for any missing or special values and obvious inconsistency. Some missing values were found in this scan and dealt with in different ways according to the nature of the variables.
A second scan was performed to check for outliers. The outliers were inspected, but due to the nature of the dataset, they were left untreated.
Finally, log transformations and a Box-Cox transformation was performed on numeric variables for better understanding and illustration of this dataset.

Data

The dataset is drawn from World Bank Open Data(https://datacatalog.worldbank.org/dataset/world-development-indicators). This dataset contains the primary World Bank collection of development indicators, compiled from officially-recognized international sources. It presents the most current and accurate global development data available, and includes national, regional and global estimates.
This dataset is licensed under CC-BY 4.0, which allows users to copy, modify and distribute data in any format for any purpose.

Variable Descriptions

This dataset comes in the form of a “WDIEXCEL.xlsx”. Only two sheets “Country” and “Data” are used in the preprocessing.

“Country”

Although this sheet contains 30 variables, only some of them, as listed below, are relevant for the purposes of this preprocessing.
1.Country Code: Unique three-letter codes of a country.
2.Currency Unit: The unit of a country’s official currency.
3.Special Notes: Notable details of the rows.
4.Region: A categorical variable that indicates the geographic region group of a country.
5.Income Group: A categorical variable that indicates income level of a country, divided in 4 income groups.
Other variables are dropped after reading the file.

“Data”

This sheet contains following variables:
1.Country Name: Name of a country.
2.Country Code: Unique three-letter codes of a country, same as in “Country”.
3.Indicator Name: Name of World Development Indicators.
This variable will be dropped, because after tidying, the corresponding Indicator Code will be used as column names, making this variable superfluous.
4.Indicator Code: Short code of World Development Indicators.
1960-2020: Numeric variables that indicates the value of each indicator for each country in each year.

Note: There are rows for aggregate and indexing purpose in both sheets. These rows will be removed in “Scan I” part.

# Read the file
country <- read_excel("WDIEXCEL.xlsx",sheet ="Country")
data <- read_excel("WDIEXCEL.xlsx",sheet ="Data")

# Check the imported datasets
head(country)

head(data)

# Subset the datasets to remove unneeded variables
country <- country %>% select(`Country Code`, `Short Name`,
                              `Currency Unit`,`Special Notes`,Region,`Income Group`)
data <- data %>% select(-`Indicator Name`)

# Check the datasets after subsetting
head(country)

head(data)

# Merge the datasets with the common key `Country Code` 
# Only two columns in "country" will be joined to "wdi", the purpose of the other columns will be explained in Scan I
wdi <- data %>% left_join(select(country,`Country Code`, Region,`Income Group`),
                          by="Country Code")
# Check the resulted dataset to be processed further
head(wdi)

Understand

The resulted wdi dataset contains 66 variables.
Among them, Country Name and Country Code are character variables with values unique to each country.
Indicator Code are character variables with values unique to each World Development Indicator.
1960 to 2020 are numeric variables containing the values of each World Development Indicator of each country in each year.
Region and Income Group are character variables with values indicating region and income groups, which should be converted to factors.

# Check the structure of wdi, `1960` to `2019`is omitted to reduce the length of the report
str(wdi %>% select(-(`1960`:`2019`)))

## tibble [379,368 x 6] (S3: tbl_df/tbl/data.frame)
##  $ Country Name  : chr [1:379368] "Arab World" "Arab World" "Arab World" "Arab World" ...
##  $ Country Code  : chr [1:379368] "ARB" "ARB" "ARB" "ARB" ...
##  $ Indicator Code: chr [1:379368] "EG.CFT.ACCS.ZS" "EG.ELC.ACCS.ZS" "EG.ELC.ACCS.RU.ZS" "EG.ELC.ACCS.UR.ZS" ...
##  $ 2020          : num [1:379368] NA NA NA NA NA NA NA NA NA NA ...
##  $ Region        : chr [1:379368] NA NA NA NA ...
##  $ Income Group  : chr [1:379368] NA NA NA NA ...

# Convert `Region` to factor, `Income Group` to ordered factor
wdi <- wdi %>% mutate( Region=factor(Region),
                       `Income Group`=factor(`Income Group`,
                                             levels = c("Low income","Lower middle income",
                                                        "Upper middle income","High income"),
                                             ordered=TRUE))

# Check the structure of wdi after conversion, `1960` to `2019`is omited to reduce the length of the report
str(wdi %>% select(-(`1960`:`2019`)))

## tibble [379,368 x 6] (S3: tbl_df/tbl/data.frame)
##  $ Country Name  : chr [1:379368] "Arab World" "Arab World" "Arab World" "Arab World" ...
##  $ Country Code  : chr [1:379368] "ARB" "ARB" "ARB" "ARB" ...
##  $ Indicator Code: chr [1:379368] "EG.CFT.ACCS.ZS" "EG.ELC.ACCS.ZS" "EG.ELC.ACCS.RU.ZS" "EG.ELC.ACCS.UR.ZS" ...
##  $ 2020          : num [1:379368] NA NA NA NA NA NA NA NA NA NA ...
##  $ Region        : Factor w/ 7 levels "East Asia & Pacific",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ Income Group  : Ord.factor w/ 4 levels "Low income"<"Lower middle income"<..: NA NA NA NA NA NA NA NA NA NA ...

Tidy & Manipulate Data I

In wdi dataset, column names of 1960 to 2020 are values of the variable “Year”, while column Indicator Code contains variable names instead of observations. Therefore, the tidy data principles “each observation must have its own row” and “each variable must have its own column” are violated. In other words, wdi is untidy.
To tidy this dataset, 1960 to 2020 are gathered intoYear and then Indicator Code is spread in order to form a tidy dataset. The Year variable is then converted form character to date format, and further subset is performed to focus on indicators SP.POP.TOTL( total population of a country ), SP.POP.TOTL.MA.IN( male population of a country ), SP.POP.TOTL.FE.IN( female population of a country ) in the year 2019.

# Gather() and spread() are used to tidy wdi 
wdi <- wdi %>% gather(`1960`:`2020`,key="Year",value = "value")%>%
  spread(key=`Indicator Code`,value = value)

# Convert characteristics `Year` to Date
wdi$Year <- as.Date(paste(wdi$Year,"01","01"), format="%Y%m%d")

# Further subset wdi in 2019
wdi_2019 <- wdi %>% select(`Country Name`,`Country Code`,Region,`Income Group`,
                           Year,`SP.POP.TOTL`,`SP.POP.TOTL.MA.IN`,`SP.POP.TOTL.FE.IN`) %>%
  filter(year(Year)==2019) 

# Check wdi_2019
head(wdi_2019)

Tidy & Manipulate Data II

In order to compare the male to female ratio in different countries, a numeric variable Gender_ratio is created and computed from male and female population.

# Create and compute `Gender_ratio` column 
wdi_2019 <- wdi_2019 %>% mutate(Gender_ratio=SP.POP.TOTL.MA.IN/SP.POP.TOTL.FE.IN)

# Check wdi_2019 after tidy and manipulation
head(wdi_2019)

Scan I

In order to scan the wdi_2019 dataset for missing and special values ( i.e. NaN, -Inf and Inf ), some functions are created, and then applied to the wdi_2019 dataset.
From the output of the functions, it appears that there are missing values in Region, Income Group and the numeric variables, while no special value presents in the dataset.
Further inspection shows that the missing values in Region, Income Group are caused by the aggregate and indexing rows that are not actual observation which should be excluded anyway.
After excluding the aggregate and indexing rows, only the country Eritrea has missing value in the SP.POP.TOTL( total population of a country ) variable. Thus, this value is simply computed from the difference of total world population and the sum population of other countries in 2019.
Then the missing Gender_ratio values are replaced with the mean Gender_ratio of all the other countries using the Hmisc package. And the missing male and female population are computed accordingly.
Finally, editrules package is used to make sure there is no obvious inconsistency in the dataset. Although due to editrules’s failure to support variables with space, “/” operator or even the multiplication between two variables, only simple rules are applied.

# Function for checking missing and special values.
check_na_special <- function(x){
  a <- sum(is.na(x))
  b <- sum(is.nan(x))
  c <- sum(is.infinite(x))
  return(c(a,b,c))
}

# Function for displaying the number and proportion of missing and special values.
num_na_special <- function(x){
  print(cbind(c("num_NA","num_NaN","num_Inf"),
              sapply(x,check_na_special)))
}

pro_na_special <- function(x){
  print(cbind(c("pro_NA","pro_NaN","pro_Inf"),
              sapply(x,function(x) check_na_special(x)/length(x))))
}

# Function for printing rows with missing values in specific columns
print_na <- function(df,cols){
  vec <- rep(FALSE,times=nrow(df))
  for (col in cols) {
   for (row in 1:nrow(df)) {
     if(!vec[row]){
       if(is.na(df[row,col])) {
         vec[row] <- TRUE
       }
     }
   }
  }
  head(df[vec,],n=sum(vec))
}

# Check missing and special values in wdi_2019
wdi_2019 %>% num_na_special()

##                Country Name Country Code Region Income Group Year SP.POP.TOTL
## [1,] "num_NA"  "0"          "0"          "47"   "47"         "0"  "2"        
## [2,] "num_NaN" "0"          "0"          "0"    "0"          "0"  "0"        
## [3,] "num_Inf" "0"          "0"          "0"    "0"          "0"  "0"        
##      SP.POP.TOTL.MA.IN SP.POP.TOTL.FE.IN Gender_ratio
## [1,] "25"              "25"              "25"        
## [2,] "0"               "0"               "0"         
## [3,] "0"               "0"               "0"

wdi_2019 %>% pro_na_special()

##                Country Name Country Code Region             
## [1,] "pro_NA"  "0"          "0"          "0.178030303030303"
## [2,] "pro_NaN" "0"          "0"          "0"                
## [3,] "pro_Inf" "0"          "0"          "0"                
##      Income Group        Year SP.POP.TOTL           SP.POP.TOTL.MA.IN   
## [1,] "0.178030303030303" "0"  "0.00757575757575758" "0.0946969696969697"
## [2,] "0"                 "0"  "0"                   "0"                 
## [3,] "0"                 "0"  "0"                   "0"                 
##      SP.POP.TOTL.FE.IN    Gender_ratio        
## [1,] "0.0946969696969697" "0.0946969696969697"
## [2,] "0"                  "0"                 
## [3,] "0"                  "0"

# Inspect missing values in Region and `Income Group`
print_na(wdi_2019,c("Region","Income Group"))

# Record the World total population for later computation
world_pop <- wdi_2019$SP.POP.TOTL[wdi_2019$`Country Name`=="World"]

# In order to identify the aggregate rows,the country dataset is scanned
# The aggregate rows are those with the word "aggregate" in `Special Notes` or do not have a `Currency Unit`
# Their `Country Code` and `Short Name` are selected to create a lookup table
aggregate_table <- country %>% filter(str_detect(country$`Special Notes`,pattern="aggregate")
                                      |is.na(country$`Currency Unit`)) %>% 
            select(`Country Code`,`Short Name`) %>% unique()

# Check the lookup table
print(aggregate_table)

## # A tibble: 46 x 2
##    `Country Code` `Short Name`                                 
##    <chr>          <chr>                                        
##  1 ARB            Arab World                                   
##  2 CEB            Central Europe and the Baltics               
##  3 CSS            Caribbean small states                       
##  4 EAP            East Asia & Pacific (excluding high income)  
##  5 EAR            Early-demographic dividend                   
##  6 EAS            East Asia & Pacific                          
##  7 ECA            Europe & Central Asia (excluding high income)
##  8 ECS            Europe & Central Asia                        
##  9 EMU            Euro area                                    
## 10 EUU            European Union                               
## # ... with 36 more rows

# Exclude rows with `Country Code` in the lookup table or with`Country Code` == "INX" which indicates the indexing rows
wdi_2019 <- wdi_2019 %>% filter(!(`Country Code` %in% aggregate_table$`Country Code`|`Country Code` == "INX") )

# Check the number of missing and special values after removing the superfluous rows
wdi_2019 %>% num_na_special()

##                Country Name Country Code Region Income Group Year SP.POP.TOTL
## [1,] "num_NA"  "0"          "0"          "0"    "0"          "0"  "1"        
## [2,] "num_NaN" "0"          "0"          "0"    "0"          "0"  "0"        
## [3,] "num_Inf" "0"          "0"          "0"    "0"          "0"  "0"        
##      SP.POP.TOTL.MA.IN SP.POP.TOTL.FE.IN Gender_ratio
## [1,] "24"              "24"              "24"        
## [2,] "0"               "0"               "0"         
## [3,] "0"               "0"               "0"

# It appears that the only country with missing SP.POP.TOTL is Eritrea
print_na(wdi_2019, "SP.POP.TOTL")

# Impute the missing SP.POP.TOTL 
wdi_2019[is.na(wdi_2019$SP.POP.TOTL),"SP.POP.TOTL"]<- world_pop - sum(wdi_2019$SP.POP.TOTL,na.rm = TRUE)

# Inspect missing values in other numeric variables
print_na(wdi_2019,c("SP.POP.TOTL","SP.POP.TOTL.MA.IN","SP.POP.TOTL.FE.IN","Gender_ratio"))

# Impute the missing Gender_ratio with column mean 
wdi_2019$Gender_ratio <- impute(wdi_2019$Gender_ratio,fun=mean)

# Compute missing female population from total population and Gender_ratio
for(x in 1:length(wdi_2019$SP.POP.TOTL.FE.IN)) {
    if(is.na(wdi_2019$SP.POP.TOTL.FE.IN[x])) {
      wdi_2019$SP.POP.TOTL.FE.IN[x] <-
        round(wdi_2019$SP.POP.TOTL[x]/(1+wdi_2019$Gender_ratio[x]))
    }
    else {next}
}

# Compute missing male population from total population and female population
for(x in 1:length(wdi_2019$SP.POP.TOTL.MA.IN)){
    if(is.na(wdi_2019$SP.POP.TOTL.MA.IN[x])) {
      wdi_2019$SP.POP.TOTL.MA.IN[x] <-
        round(wdi_2019$SP.POP.TOTL[x]-wdi_2019$SP.POP.TOTL.FE.IN[x])
    }
    else {next}
}

# Check all the missing values have been dealt with
wdi_2019 %>% num_na_special()

##                Country Name Country Code Region Income Group Year SP.POP.TOTL
## [1,] "num_NA"  "0"          "0"          "0"    "0"          "0"  "0"        
## [2,] "num_NaN" "0"          "0"          "0"    "0"          "0"  "0"        
## [3,] "num_Inf" "0"          "0"          "0"    "0"          "0"  "0"        
##      SP.POP.TOTL.MA.IN SP.POP.TOTL.FE.IN Gender_ratio
## [1,] "0"               "0"               "0"         
## [2,] "0"               "0"               "0"         
## [3,] "0"               "0"               "0"

# Check obvious inconsistencies or errors
# Import rules
rules <- editfile("Rules.txt", type = "all")

# Check rules
rules

## 
## Data model:
## dat1 : Region %in% c('East Asia & Pacific', 'Europe & Central Asia', 'Latin America & Caribbean', 'Middle East & North Africa', 'North America', 'South Asia', 'Sub-Saharan Africa') 
## 
## Edit set:
## num1 : 0 < SP.POP.TOTL
## num2 : 0 < SP.POP.TOTL.MA.IN
## num3 : 0 < SP.POP.TOTL.FE.IN
## num4 : SP.POP.TOTL == SP.POP.TOTL.MA.IN + SP.POP.TOTL.FE.IN

# Check violations
summary(violatedEdits(rules, wdi_2019))

## No violations detected, 0 checks evaluated to NA

## NULL

Scan II

The numeric variables SP.POP.TOTL, SP.POP.TOTL.MA.IN, SP.POP.TOTL.FE.IN and Gender_ratio are scanned for outliers in their Income Gorup using bivariate box plot, and the outliers are printed for inspection.
Unsurprisingly, outliers in the populational variables are countries with larger population in their income groups, e.g. China, India, etc.
Gender_ratio shows a more interesting result. Most of the countries with a low male to female ratio in their income groups are in Eastern Europe, while most of the countries with a high ratio are in Middle East. This may indicate that gender ratio of a country is affected by its culture.
For every country has its own characteristics, no outlier is modified in this preprocessing.

# Boxplot and print outliers of each numeric variable
for( x in c("SP.POP.TOTL","SP.POP.TOTL.MA.IN","SP.POP.TOTL.FE.IN","Gender_ratio")){
  boxplot <-  boxplot(unlist(wdi_2019[,x])~wdi_2019$`Income Group`,
                      xlab ="Income Group", ylab = x)
  print(wdi_2019 %>% select(`Country Name`,Region,`Income Group`,Year,!!as.name(x)) %>%
          filter(!!as.name(x) %in% boxplot$out),n=length(boxplot$out))
}

## # A tibble: 28 x 5
##    `Country Name`    Region             `Income Group`    Year       SP.POP.TOTL
##    <chr>             <fct>              <ord>             <date>           <dbl>
##  1 Australia         East Asia & Pacif~ High income       2019-01-01    25364307
##  2 Bangladesh        South Asia         Lower middle inc~ 2019-01-01   163046161
##  3 Brazil            Latin America & C~ Upper middle inc~ 2019-01-01   211049527
##  4 Canada            North America      High income       2019-01-01    37589262
##  5 China             East Asia & Pacif~ Upper middle inc~ 2019-01-01  1397715000
##  6 Congo, Dem. Rep.  Sub-Saharan Africa Low income        2019-01-01    86790567
##  7 Egypt, Arab Rep.  Middle East & Nor~ Lower middle inc~ 2019-01-01   100388073
##  8 Ethiopia          Sub-Saharan Africa Low income        2019-01-01   112078730
##  9 France            Europe & Central ~ High income       2019-01-01    67059887
## 10 Germany           Europe & Central ~ High income       2019-01-01    83132799
## 11 India             South Asia         Lower middle inc~ 2019-01-01  1366417754
## 12 Indonesia         East Asia & Pacif~ Upper middle inc~ 2019-01-01   270625568
## 13 Iran, Islamic Re~ Middle East & Nor~ Upper middle inc~ 2019-01-01    82913906
## 14 Italy             Europe & Central ~ High income       2019-01-01    60297396
## 15 Japan             East Asia & Pacif~ High income       2019-01-01   126264931
## 16 Korea, Rep.       East Asia & Pacif~ High income       2019-01-01    51709098
## 17 Mexico            Latin America & C~ Upper middle inc~ 2019-01-01   127575529
## 18 Nigeria           Sub-Saharan Africa Lower middle inc~ 2019-01-01   200963599
## 19 Pakistan          South Asia         Lower middle inc~ 2019-01-01   216565318
## 20 Philippines       East Asia & Pacif~ Lower middle inc~ 2019-01-01   108116615
## 21 Poland            Europe & Central ~ High income       2019-01-01    37970874
## 22 Russian Federati~ Europe & Central ~ Upper middle inc~ 2019-01-01   144373535
## 23 Saudi Arabia      Middle East & Nor~ High income       2019-01-01    34268528
## 24 Spain             Europe & Central ~ High income       2019-01-01    47076781
## 25 Turkey            Europe & Central ~ Upper middle inc~ 2019-01-01    83429615
## 26 United Kingdom    Europe & Central ~ High income       2019-01-01    66834405
## 27 United States     North America      High income       2019-01-01   328239523
## 28 Vietnam           East Asia & Pacif~ Lower middle inc~ 2019-01-01    96462106

## # A tibble: 28 x 5
##    `Country Name`   Region          `Income Group`   Year       SP.POP.TOTL.MA.~
##    <chr>            <fct>           <ord>            <date>                <dbl>
##  1 Australia        East Asia & Pa~ High income      2019-01-01         12631543
##  2 Bangladesh       South Asia      Lower middle in~ 2019-01-01         82473785
##  3 Brazil           Latin America ~ Upper middle in~ 2019-01-01        103733164
##  4 Canada           North America   High income      2019-01-01         18651974
##  5 China            East Asia & Pa~ Upper middle in~ 2019-01-01        717118675
##  6 Congo, Dem. Rep. Sub-Saharan Af~ Low income       2019-01-01         43319103
##  7 Egypt, Arab Rep. Middle East & ~ Lower middle in~ 2019-01-01         50722597
##  8 Ethiopia         Sub-Saharan Af~ Low income       2019-01-01         56069010
##  9 France           Europe & Centr~ High income      2019-01-01         32458740
## 10 Germany          Europe & Centr~ High income      2019-01-01         41059335
## 11 India            South Asia      Lower middle in~ 2019-01-01        710129571
## 12 Indonesia        East Asia & Pa~ Upper middle in~ 2019-01-01        136269762
## 13 Iran, Islamic R~ Middle East & ~ Upper middle in~ 2019-01-01         41889897
## 14 Italy            Europe & Centr~ High income      2019-01-01         29338084
## 15 Japan            East Asia & Pa~ High income      2019-01-01         61659571
## 16 Korea, Rep.      East Asia & Pa~ High income      2019-01-01         25891693
## 17 Mexico           Latin America ~ Upper middle in~ 2019-01-01         62403393
## 18 Nigeria          Sub-Saharan Af~ Lower middle in~ 2019-01-01        101831872
## 19 Pakistan         South Asia      Lower middle in~ 2019-01-01        111447582
## 20 Philippines      East Asia & Pa~ Lower middle in~ 2019-01-01         54316068
## 21 Poland           Europe & Centr~ High income      2019-01-01         18400811
## 22 Russian Federat~ Europe & Centr~ Upper middle in~ 2019-01-01         66908372
## 23 Saudi Arabia     Middle East & ~ High income      2019-01-01         19783531
## 24 Spain            Europe & Centr~ High income      2019-01-01         23127681
## 25 Turkey           Europe & Centr~ Upper middle in~ 2019-01-01         41173713
## 26 United Kingdom   Europe & Centr~ High income      2019-01-01         33007819
## 27 United States    North America   High income      2019-01-01        162417882
## 28 Vietnam          East Asia & Pa~ Lower middle in~ 2019-01-01         48151351

## # A tibble: 28 x 5
##    `Country Name`   Region          `Income Group`   Year       SP.POP.TOTL.FE.~
##    <chr>            <fct>           <ord>            <date>                <dbl>
##  1 Australia        East Asia & Pa~ High income      2019-01-01         12732764
##  2 Bangladesh       South Asia      Lower middle in~ 2019-01-01         80572376
##  3 Brazil           Latin America ~ Upper middle in~ 2019-01-01        107316363
##  4 Canada           North America   High income      2019-01-01         18937288
##  5 China            East Asia & Pa~ Upper middle in~ 2019-01-01        680596325
##  6 Congo, Dem. Rep. Sub-Saharan Af~ Low income       2019-01-01         43471464
##  7 Egypt, Arab Rep. Middle East & ~ Lower middle in~ 2019-01-01         49665476
##  8 Ethiopia         Sub-Saharan Af~ Low income       2019-01-01         56009720
##  9 France           Europe & Centr~ High income      2019-01-01         34601147
## 10 Germany          Europe & Centr~ High income      2019-01-01         42073464
## 11 India            South Asia      Lower middle in~ 2019-01-01        656288183
## 12 Indonesia        East Asia & Pa~ Upper middle in~ 2019-01-01        134355806
## 13 Iran, Islamic R~ Middle East & ~ Upper middle in~ 2019-01-01         41024009
## 14 Italy            Europe & Centr~ High income      2019-01-01         30959312
## 15 Japan            East Asia & Pa~ High income      2019-01-01         64605360
## 16 Korea, Rep.      East Asia & Pa~ High income      2019-01-01         25817405
## 17 Mexico           Latin America ~ Upper middle in~ 2019-01-01         65172136
## 18 Nigeria          Sub-Saharan Af~ Lower middle in~ 2019-01-01         99131727
## 19 Pakistan         South Asia      Lower middle in~ 2019-01-01        105117736
## 20 Philippines      East Asia & Pa~ Lower middle in~ 2019-01-01         53800547
## 21 Poland           Europe & Centr~ High income      2019-01-01         19570063
## 22 Russian Federat~ Europe & Centr~ Upper middle in~ 2019-01-01         77465163
## 23 Saudi Arabia     Middle East & ~ High income      2019-01-01         14484997
## 24 Spain            Europe & Centr~ High income      2019-01-01         23949100
## 25 Turkey           Europe & Centr~ Upper middle in~ 2019-01-01         42255902
## 26 United Kingdom   Europe & Centr~ High income      2019-01-01         33826586
## 27 United States    North America   High income      2019-01-01        165821641
## 28 Vietnam          East Asia & Pa~ Lower middle in~ 2019-01-01         48310755

## # A tibble: 16 x 5
##    `Country Name`     Region            `Income Group`   Year       Gender_ratio
##    <chr>              <fct>             <ord>            <date>            <dbl>
##  1 Afghanistan        South Asia        Low income       2019-01-01        1.05 
##  2 Bahrain            Middle East & No~ High income      2019-01-01        1.80 
##  3 Belarus            Europe & Central~ Upper middle in~ 2019-01-01        0.871
##  4 Bhutan             South Asia        Lower middle in~ 2019-01-01        1.13 
##  5 Djibouti           Middle East & No~ Lower middle in~ 2019-01-01        1.11 
##  6 El Salvador        Latin America & ~ Lower middle in~ 2019-01-01        0.881
##  7 Equatorial Guinea  Sub-Saharan Afri~ Upper middle in~ 2019-01-01        1.25 
##  8 Kuwait             Middle East & No~ High income      2019-01-01        1.56 
##  9 Maldives           South Asia        Upper middle in~ 2019-01-01        1.72 
## 10 Nepal              South Asia        Lower middle in~ 2019-01-01        0.838
## 11 Oman               Middle East & No~ High income      2019-01-01        1.94 
## 12 Qatar              Middle East & No~ High income      2019-01-01        3.05 
## 13 Russian Federation Europe & Central~ Upper middle in~ 2019-01-01        0.864
## 14 Saudi Arabia       Middle East & No~ High income      2019-01-01        1.37 
## 15 Ukraine            Europe & Central~ Lower middle in~ 2019-01-01        0.863
## 16 United Arab Emira~ Middle East & No~ High income      2019-01-01        2.25

Transform

Population of countries vary from tens of thousands to billions. Therefore, the populational variables are highly right skewed. For better understanding, log transformation with base 10 is applied on these variables. Moreover, Box-Cox transformation is applied on Gender_ratio to normalise it.
New variables are created at this step to preserve original data for further use.

# Check the skewness of polulation values.
for(x in c("SP.POP.TOTL","SP.POP.TOTL.MA.IN","SP.POP.TOTL.FE.IN")){
     hist(wdi_2019[,x])
}

# Apply base 10 log transformation
wdi_2019 <- wdi_2019 %>% mutate(SP.POP.TOTL_log = log10(SP.POP.TOTL),
                                SP.POP.TOTL.MA.IN_log = log10(SP.POP.TOTL.MA.IN),
                                SP.POP.TOTL.FE.IN_log = log10(SP.POP.TOTL.FE.IN))

# Check the skewness after log transformation.
for(x in c("SP.POP.TOTL_log","SP.POP.TOTL.MA.IN_log","SP.POP.TOTL.FE.IN_log")){
     hist(wdi_2019[,x])
}

# Check the distribution of Gender_ratio
hist(wdi_2019$"Gender_ratio")

# Normalise the Gender_ratio with Box-Cox transformation
wdi_2019$Gender_ratio_BC <- BoxCox(wdi_2019$Gender_ratio,lambda = "auto")

# Check the distribution after Box-Cox transformation
hist(wdi_2019$Gender_ratio_BC)