Required packages

Following are the required packages:

library(readr)
library(tidyr)
library(dplyr)
library(Hmisc)
library(outliers)
library(ggplot2)

Executive Summary

Three data sets were imported for this report from the world bank open data website. The data sets contain information about population, employment ratio, and income group of countries of the world. These three data sets contain data of 264 countries from the year 1960 to 2019. The initial preprocessing step was to filter data of relevant timeframe i.e. of 10 years from 2009-2018 and other columns that contain meaningful and related information, and countries that have a maximum of 3 NA values because if more values of NA were allowed than the data will become more centric to the value for which these NA values will be imputed across the selected year column.

The data sets were untidy and were restructured from wide to long format using gather function. The data sets were then merged using the appropriate join function as per relevancy and data requirement. The datatypes of the merged data set were then converted to desired datatypes. The NA values were scanned and imputed with the mean of that particular country during the selected timeframe. Then special values were checked of all numeric variables. A new variable was then created based on the existing variables in the data set. Outliers were scanned and visualized using boxplot and were detected using z-scores. The outliers were not removed as it consists of facts which in turn will alter the data which would not have been suitable. The last step included transforming the ‘population’ variable to decrease the skewness and convert the distribution into normal distribution.

Please note that the order of tasks has been altered, and changes in steps has been made as per requirement of data sets.

Data

The following three datasets were taken from the online WorldBank data repository.

Variable description:

The first data set shows the employment ratio of population for the countries all over the world for the years 1960-2019.

The data set contains 264 observations and 64 variables.

The second dataset consists of population of countries all over the world for the years 1960-2019.

The data set contains 264 observations and 64 variables.

The third data set contains the income group of countries along with other related information.

The data set contains 264 observations and 5 variables.

The employment ratio data set is downloaded from https://data.worldbank.org/indicator/SL.EMP.TOTL.SP.ZS

The population data set from: https://data.worldbank.org/indicator/SP.POP.TOTL?locations=1W

The third data set is the meta data which comes along both of the above data sets when they are downloaded.

The three datasets were read and saved as employment, population, and income respectively.

employment<-read_csv("API_SL.EMP.TOTL.SP.ZS_DS2_en_csv_v2_1070754.csv", skip=4)
## Warning: Missing column names filled in: 'X65' [65]
## Parsed with column specification:
## cols(
##   .default = col_logical(),
##   `Country Name` = col_character(),
##   `Country Code` = col_character(),
##   `Indicator Name` = col_character(),
##   `Indicator Code` = col_character(),
##   `1991` = col_double(),
##   `1992` = col_double(),
##   `1993` = col_double(),
##   `1994` = col_double(),
##   `1995` = col_double(),
##   `1996` = col_double(),
##   `1997` = col_double(),
##   `1998` = col_double(),
##   `1999` = col_double(),
##   `2000` = col_double(),
##   `2001` = col_double(),
##   `2002` = col_double(),
##   `2003` = col_double(),
##   `2004` = col_double(),
##   `2005` = col_double(),
##   `2006` = col_double()
##   # ... with 13 more columns
## )
## See spec(...) for full column specifications.
population<-read_csv("API_SP.POP.TOTL_DS2_en_csv_v2_1068829.csv", skip=4)
## Warning: Missing column names filled in: 'X65' [65]
## Parsed with column specification:
## cols(
##   .default = col_double(),
##   `Country Name` = col_character(),
##   `Country Code` = col_character(),
##   `Indicator Name` = col_character(),
##   `Indicator Code` = col_character(),
##   `2019` = col_logical(),
##   X65 = col_logical()
## )
## See spec(...) for full column specifications.
income<-read_csv("Metadata_Country_API_SP.POP.TOTL_DS2_en_csv_v2_1068829.csv")
## Warning: Missing column names filled in: 'X6' [6]
## Parsed with column specification:
## cols(
##   `Country Code` = col_character(),
##   Region = col_character(),
##   IncomeGroup = col_character(),
##   SpecialNotes = col_character(),
##   TableName = col_character(),
##   X6 = col_logical()
## )

Filtering the data sets to get targeted columns and removing unwanted columns that are not of interest. The year 2019 has no values in the employment data set. Therefore, not including it.

For employment data set the columns of interest are: Country Name, Country Code, 2009-2018 years.

employment <- employment[, c(1,2,54:63)]
head(employment)

For Population data set the columns of interest are: Country Name, Country Code, 2009-2018 years.

population <- population[,c(1,2,54:63)]
head(population)

For income data set the columns of interest are: Country Code, IncomeGroup and TableName.

Changing Column ‘TableName’ name’s to ‘Country Code’ for ease in merging the data sets.

income<-income[, c(1,3,5)]

#converting TableName column to Country Name
colnames(income)[3] <- "Country Name"

head(income)

Tidy & Manipulate Data I

For the data set: ‘employment’, we can see that there are certain countries that do not have any entries for any of the years, therefore, it contains all NA values. Therefore, filtering out the countries that have NA values greater then 3 as we are considering 10 years of data for each country and if there are more NA values, then imputing the missing values with the mean won’t give better results as it will make the data more centric towards the mean. Hence, filtering out the countries with more than 3 NA values for the employment data set. Later we checked for NA values after merging all data sets into one merged data.

position <- which(rowSums(is.na(employment)) <= 3)
employment<- employment[position,]

The data sets are untidy as they do not follow the tidy data principles.

The data sets employment and population are untidy as they are in wide format and they also contain NA values.

Converting the data sets to long format using gather() function.

employment <- gather(employment, `2009`:`2018`, key = 'Year', value = 'employment ratio')
head(employment)
population <- gather(population, `2009`:`2018`, key = 'Year', value = 'population')
head(population)

The data set ‘Income’ consists of missing values in the IncomeGroup variable which is a categorical variable that makes this data set untidy.

This is a valuable indicator as it represents income group. Imputing with mode will vary the particular country’s income group. Therefore, treating the missing values as a new category “No Response”. So, imputing the missing value with “No Response”

income$IncomeGroup <- impute(income$IncomeGroup, fun= "No Response")
head(income)

Tidy & Manipulate Data II

As we have checked for NA values previously and deleted the rows with more than 3 NA values for employment data set, here we merged employment and population data sets using the left join as employment data has 233 countries and, population data has 264 countries.

And then, merged the merged data set with Income data set using inner join.

merge1<-left_join(employment,population  )
## Joining, by = c("Country Name", "Country Code", "Year")
merge_final<-inner_join(merge1, income) 
## Joining, by = c("Country Name", "Country Code")
head(merge_final)

Understand

Here we checked the class, dimension and structure of our merged data set.

class(merge_final)
## [1] "tbl_df"     "tbl"        "data.frame"
dim(merge_final)
## [1] 2250    6
str(merge_final)
## Classes 'tbl_df', 'tbl' and 'data.frame':    2250 obs. of  6 variables:
##  $ Country Name    : chr  "Afghanistan" "Angola" "Albania" "Arab World" ...
##  $ Country Code    : chr  "AFG" "AGO" "ALB" "ARB" ...
##  $ Year            : chr  "2009" "2009" "2009" "2009" ...
##  $ employment ratio: num  42.1 74.3 47.1 43.8 80.8 ...
##  $ population      : num  2.84e+07 2.25e+07 2.93e+06 3.47e+08 7.92e+06 ...
##  $ IncomeGroup     : 'impute' chr  "Low income" "Lower middle income" "Upper middle income" "No Response" ...
##   ..- attr(*, "imputed")= int  4 30 42 51 52 53 54 55 58 63 ...

It is noted that the Country Name, IncomeGroup and year are of character datatype which needs to be converted to factor datatype.

The rest of the variable’s data type is as they are required.

merge_final$`Country Name`<-factor(merge_final$`Country Name`)
merge_final$IncomeGroup <- factor(merge_final$IncomeGroup, 
                                  levels = c("Low income","Lower middle income","Upper middle income","High income","No Response"), 
                                  ordered = TRUE)
merge_final$Year <- factor(merge_final$Year, 
                             levels = c(2009,2010,2011,2012,2013,2014,2015,2016,2017,2018),ordered = TRUE)

Structure after variables datatype conversion.

str(merge_final)
## Classes 'tbl_df', 'tbl' and 'data.frame':    2250 obs. of  6 variables:
##  $ Country Name    : Factor w/ 225 levels "Afghanistan",..: 1 4 2 5 211 6 7 8 9 10 ...
##  $ Country Code    : chr  "AFG" "AGO" "ALB" "ARB" ...
##  $ Year            : Ord.factor w/ 10 levels "2009"<"2010"<..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ employment ratio: num  42.1 74.3 47.1 43.8 80.8 ...
##  $ population      : num  2.84e+07 2.25e+07 2.93e+06 3.47e+08 7.92e+06 ...
##  $ IncomeGroup     : Ord.factor w/ 5 levels "Low income"<"Lower middle income"<..: 1 2 3 5 4 3 3 4 4 3 ...

Merged dataset after datatype conversion.

head(merge_final)

Scan I

Firstly, we find NA values in all the variables of our merged data set

colSums(is.na(merge_final))
##     Country Name     Country Code             Year employment ratio 
##                0                0                0                0 
##       population      IncomeGroup 
##                7                0

It can be noticed that population variable consist of 7 missing values. Checking the missing values for population by groupig by Income Group.

merge_final %>% group_by(IncomeGroup) %>% 
summarise(Mean = mean(population, na.rm = TRUE),
Count = n(),
`Missing values` = sum(is.na(population)))

Checking the missing values for population by grouping by Year.

merge_final %>% group_by(Year) %>% 
summarise(Mean = mean(population, na.rm = TRUE),
Count = n(),
`Missing values` = sum(is.na(population)))

Now,It can be seen that the population variable consists of 7 NA values. Therefore, imputing these values with the mean of that country of the values from the years 2009-2018 using impute() function.

merge_final<- group_by(merge_final, `Country Name`) %>% mutate(population = impute(population, fun = mean)) %>% ungroup()
## Warning in mutate_impl(.data, dots, caller_env()): Vectorizing 'impute' elements
## may not preserve their attributes
colSums(is.na(merge_final))
##     Country Name     Country Code             Year employment ratio 
##                0                0                0                0 
##       population      IncomeGroup 
##                0                0

All NA values of the population variable imputed and no other NA values present in the data set.

Checking Special values.

Creating a function is.special to check every numerical column whether they have infinite or NaN values.

is.special <- function(x)
  {
if (is.numeric(x)) (is.infinite(x) | is.nan(x))
}
specialdf <- (sapply(merge_final, is.special))
specialdf <- as.data.frame(specialdf[c(4,5)])
colSums(specialdf)
## employment.ratio       population 
##                0                0

From the above it can be noted that their are no Special values in our merged data set.

Creating new variable.

As we have checked for all missing, special, infinite, NaN, and NA values, Now we will create a new variable ‘employment ratio’ which shows the count of the employed population from the variables ‘employment ratio’ and ‘population’.

final_data <- mutate(merge_final, `employed population` = round((merge_final$`employment ratio`/100)*population))
head(final_data)

Rearranging the columns for better understanding.

final_data <- final_data[, c(1,2,6,3,4,5,7)]
head(final_data)

Scan II

Checking for outliers:

Visualisation of outliers

Here we visualize the outliers for our numeric variables: population, employed population and employment ratio using boxplots.

Visualisation of outliers for employed population variable:

boxplot(final_data$`employed population` ~ final_data$IncomeGroup, main="Employed Population count by Income group", ylab = "Employed Population count", xlab = "Income group", col="skyblue",cex.axis=0.53)

Visualisation of outliers for employment ratio variable:

boxplot(final_data$`employment ratio` ~ final_data$IncomeGroup, main="Employed Population count by Income group", ylab = "Employed Population ratio to population", xlab = "Income group", col="skyblue", cex.axis=0.53)

Visualisation of outliers for population variable:

boxplot(final_data$population ~ final_data$IncomeGroup, main="Employed Population count by Income group", ylab = "Employed Population count", xlab = "Income group", col="skyblue",cex.axis=0.53)

Detecting outliers

Detection of outliers is carried out using z-scores.

Here, the locations and total count of outliers were calculated for our all 3 numeric variables : employed population, employment ratio and population. Detecting outliers for employed population variable:

z.scores1 <- final_data$`employed population`%>%  scores(type = "z")
z.scores1 %>% summary()
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.3257 -0.3227 -0.3137  0.0000 -0.2606  7.3324
# Finds the locations of outliers in the `employed population` variable

which( abs(z.scores1) >3 )
##  [1]   87   88  120  133  220  312  313  345  358  445  537  538  570  583  670
## [16]  762  763  795  808  895  987  988 1020 1033 1120 1212 1213 1245 1258 1345
## [31] 1437 1438 1470 1483 1570 1662 1663 1695 1708 1795 1887 1888 1920 1933 2020
## [46] 2112 2113 2145 2158 2245
# Finds the total number of outliers of `employed population`according to the z-score

length (which( abs(z.scores1) >3 ))
## [1] 50

Detecting outliers for employment ratio variable:

z.scores2 <- final_data$`employment ratio`%>%  scores(type = "z")
z.scores2 %>% summary()
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -2.39295 -0.59875  0.02853  0.00000  0.56396  2.73061
# Finds the locations of outliers in the `employment ratio` variable

which( abs(z.scores2) >3 )
## integer(0)
# Finds the total number of outliers of `employment ratio`according to the z-score

length (which( abs(z.scores2) >3 ))
## [1] 0

Detecting outliers for population variable:

z.scores3 <- final_data$population%>%  scores(type = "z")
z.scores3 %>% summary()
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.3265 -0.3231 -0.3152  0.0000 -0.2571  7.4305
# Finds the locations of outliers in the population variable

which( abs(z.scores3) >3 )
##  [1]   87   88  120  133  220  312  313  345  358  445  537  538  570  583  670
## [16]  762  763  795  808  895  987  988 1020 1033 1120 1212 1213 1245 1258 1345
## [31] 1437 1438 1470 1483 1570 1662 1663 1695 1708 1795 1887 1888 1920 1933 2020
## [46] 2112 2113 2145 2158 2245
# Finds the total number of outliers of population variable according to the z-score

length (which( abs(z.scores3) >3 ))
## [1] 50

Treating outliers

The outliers detection methods give us the suggested outliers in the data which tend to be far away from the rest of observations in the variable. For some data they might be possible anomalies and can be handled using omitting, imputing or capping, etc. But in our case the data consists of facts which should not be altered, it consists of different country’s populations and employment ratios which can be very high for some and can be low or moderate for other countries. These should not be treated as outliers. Removing, imputing, or capping these values will result in biased statistical results. Removing outliers might remove some countries, while imputing the outliers with mean or any other value will alter the results and will decrease the power of statistical analysis. Therefore we not treated these values as outliers and will not remove or impute them with any value.

But, just to show how treating the outliers can vary the results, we will perform sensitivity analysis by taking employed population variable and treating the outliers by capping. Then, we visualized before and after the effect of capping on the chosen variable.

Creating a function to cap the values outside the limits.

cap <- function(x){
  quantiles <- quantile( x, c(.05, 0.25, 0.75, .95 ) )
  x[ x < quantiles[2] - 1.5*IQR(x) ] <- quantiles[1]
  x[ x > quantiles[3] + 1.5*IQR(x) ] <- quantiles[4]
  x
}

Capping the outliers of ‘employed population’ variable.

employed_population_capped <- final_data$`employed population` %>% cap()

boxplot(employed_population_capped ~ final_data$IncomeGroup, main="Employed Population count by Income group", ylab = "Employed Population count", xlab = "Income group", col="skyblue",cex.axis=0.53)

From the above boxplot, it is visualized that capping has altered the data and some values got replaced with its nearest neighbour that are not outliers.

Now visualized the before and after effect of capping on data using Histogram.

hist(final_data$`employed population`, col = "skyblue")

hist(employed_population_capped, col="skyblue")

Detecting outlier’s locations for employed_population_capped variable using z-scores:

z.scores1 <- employed_population_capped %>%  scores(type = "z")
z.scores1 %>% summary()
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.5157 -0.5110 -0.4968  0.0000 -0.4133  2.0581
# Finds the locations of outliers in the `employed population` variable

which( abs(z.scores1) >3 )
## integer(0)

From the above, it can be noted that capping has altered the data to a great extent and 50 values have been capped which will vary our result and will give varied statistical results as there are biased data variations.

Same biased statistical results are obtained if omitted or imputed with mean or any other value.

We can’t afford to loose or alter data as it is sensitive in our case.

Therefore, ignoring the outliers as they can’t be considered as outliers.

Hence, ignoring the capping part and moving forward without treating the outliers.

Transform

Now taking forward the population variable and visualizing it using histogram.

hist(final_data$population, col="skyblue")

From the above histogram of population variable, it can be visualized that it right-skewed and unevenly distributed.

Now applying log10 transformation to population variable to correct the skewness.

final_data$population_log <- log10(final_data$population)
hist(final_data$population_log, col = "skyblue")

After applying the log10 transformation, the skewness of the ‘population’ variable is corrected which makes the data more symmetric and suitable for further analysis.