Required packages
Below packages are installed to enhance and add features to the existing base R functionalities
library(readr)
library(dplyr)
library(tidyr)
library(Hmisc)
library(outliers)
library(forecast)
Executive Summary
Aim:
The aim of the below preprocessing steps is to create a clean, understandable dataset from the raw data available.
Dataset:
Two datasets, one containing each country’s economic status and the other containing the details of their population’s employment ratio were taken, analyzed, cleaned and preprocessed.
Procedure:
- The two raw datasets were subsetted, renamed and joined together to form a single dataset with only the relevant information needed for analysis
- Depending on the types of variables, the necessary type conversions were applied
- The dataset was then tidied into long format and type conversions applied wherever necessary
- The tidied dataset was then checked for missing values and necessary techniques were applied to handle them
- A new variable was computed from the existing values to calculate the growth of each country in 4 years. This step involved converting the long format to wide format(for ease of calculation), calculating the growth and converting back to long format, with necessary data type conversions along the way
- The numeric variables were then checked for outliers and analysed
- The distribution of data was visualized and necessary transformations applied to render a normlly distributed data
Data
The dataset contains the details of Employment to population ratio of each country. The ages considered for this analysis is in the range of 15-24 years, which is generally considered as the youth population. The second dataset that is considered for the analysis is the metadata that contains all the information of each country, such as Income group, currency, latest population, industrial and trade data, accounting concept, etc. These datasets has been taken from World Banks’s open source data(https://data.worldbank.org).
#Importing dataset
rawDataset <- read_csv("Employment_stats.csv")
metadata <- read_csv("Country_Metadata.csv")
head(rawDataset)
head(metadata)
For the purpose of preparing the data for analysis, the two datasets are merged as one with only the required variables. This is done step by step as below:
Since the metadata contains all the information about each country, we are going to select only the variables we need(Country code/Code and Income Group) and store it in ‘IncomeGroup’ data frame
Since the country code is the variable we are to use for joining the metadata with the rawDataset, we need to ensure that te variable names match. We thus rename the variables of the ‘IncomeGroup’ dataset
#Selecting only the variables needed
IncomeGroup <- metadata %>% select(Code, `Income Group`)
#Renaming 'Code' as 'Country_Code' for merging
IncomeGroup <- rename(IncomeGroup,Country_Code=Code,Income_group='Income Group')
head(IncomeGroup)
- The rawDataset contains some unwanted information (Series Code and Series Name) that are repetitive and insignificant for our analysis. So the rawDataset is subsetted with only the variables needed, and renamed appropriately for ease of understanding
#subsetting the dataset to remove the unwanted variables 'SeriesCode' and 'seriesName'
dataset_subset <- rawDataset %>% select(-c('Series Code','Series Name'))
dataset_subset <- rename(dataset_subset, Country_Name='Country Name', Country_Code='Country Code')
names(dataset_subset)[3] <- 2014
names(dataset_subset)[4] <- 2015
names(dataset_subset)[5] <- 2016
names(dataset_subset)[6] <- 2017
head(dataset_subset)
- The two datasets are now merged using left_join() function, by the common Country_Code variable
#Joining 'dataset_subset' and 'IncomeGroup'
dataset_join <- left_join(dataset_subset, IncomeGroup, by="Country_Code" )
head(dataset_join)
NA
The variables in the dataset after joining are:
colnames(dataset_join)
[1] "Country_Name" "Country_Code" "2014" "2015"
[5] "2016" "2017" "Income_group"
Country_Name: Names of all countries, arranged alphabetically
Country_Code: A unique code for each country, used for merging the two datasets
2014-2017: Employment to population ratio of each country through the years 2014 - 2017
Income_group: Economy status of each country
Understand
Using str() function, the structure of the joined dataset(dataset_join) can be seen as:
str(dataset_join)
Classes ‘spec_tbl_df’, ‘tbl_df’, ‘tbl’ and 'data.frame': 268 obs. of 7 variables:
$ Country_Name: chr "Afghanistan" "Albania" "Algeria" "American Samoa" ...
$ Country_Code: chr "AFG" "ALB" "DZA" "ASM" ...
$ 2014 : num NA 17.7 19.4 NA NA ...
$ 2015 : num NA 18.9 17.7 NA NA ...
$ 2016 : num NA NA 18.3 NA NA ...
$ 2017 : num 32.4 21.6 19.5 NA NA ...
$ Income_group: chr "Low income" "Upper middle income" "Upper middle income" "Upper middle income" ...
From the structure, it can be seen that the data types of ‘Country_Name’, ‘Country_Code’ and ‘Income_group’ is character and the rest are numeric. But since the ‘Income_Group’ variable contains categorical data, it is converted to a factored variable using the factor() function and levels defined.
#Converting data type of income group from chr to factor
dataset_join$Income_group <- factor(dataset_join$Income_group, levels = c("Low income", "Lower middle income", "Upper middle income", "High income"), ordered = TRUE)
levels(dataset_join$Income_group)
[1] "Low income" "Lower middle income" "Upper middle income"
[4] "High income"
str(dataset_join)
Classes ‘spec_tbl_df’, ‘tbl_df’, ‘tbl’ and 'data.frame': 268 obs. of 7 variables:
$ Country_Name: chr "Afghanistan" "Albania" "Algeria" "American Samoa" ...
$ Country_Code: chr "AFG" "ALB" "DZA" "ASM" ...
$ 2014 : num NA 17.7 19.4 NA NA ...
$ 2015 : num NA 18.9 17.7 NA NA ...
$ 2016 : num NA NA 18.3 NA NA ...
$ 2017 : num 32.4 21.6 19.5 NA NA ...
$ Income_group: Ord.factor w/ 4 levels "Low income"<"Lower middle income"<..: 1 3 3 3 4 2 4 NA 3 3 ...
Tidy & Manipulate Data I
In the ‘dataset_join’ data frame, the column names 2014-2017 are values of year and not variable names. This calls for a transformation of the wide format to long format. gather() function is thus used to create the ‘year’ and ‘percentage’ variables. The arguments ‘key’ and ‘value’ correspond to the name of the variable whose values form the column names and the name of the variable whose values are spread over the cells.
Since the ‘Year’ variable is of character data type, it is converted into an integer using the as.integer() function.
#wide to long format - tidying data
dataset_join <- dataset_join %>% gather(key = "year", value= "percentage", 3:6 )
#converting data type of 'year' variable from chr to int
dataset_join$year <- as.integer(dataset_join$year)
head(dataset_join)
NA
Scan I
With the tidied dataset in long format, it is now checked for missing values in all variables. colSums() gives the total number of values missing in each variable.
#checking NA values for all variables
colSums(is.na(dataset_join))
Country_Name Country_Code Income_group year percentage
20 20 204 0 574
From above, it can be seen that there are 20 observartions with missing ‘Country_Name’ and ‘Country_Code’. Since the data relevent to our analysis becomes invalid without the ‘Country_Name’ or ‘Country_Code’, it needs to be excluded. But before proceeding, to make sure excluding these missing values do not create an impact on the dataset, the % of missing values is calculated for ‘Country_Name’ and ‘Country_Code’ variables. The total number of values in each of the variables can be obtained with length() function as:
#finding length to calculate the % of missing values in country
length(dataset_join$Country_Name)
[1] 1072
length(dataset_join$Country_Code)
[1] 1072
% of missing values = (No. of missing values / Total number of values) * 100 = (20/1072) * 100 = 1.86%
Since the percentage of missing values in country name and code is less than 5%, we can now omit the observations with missing values.
dataset_join <- dataset_join[-which(is.na(dataset_join$Country_Name)),]
colSums(is.na(dataset_join))
Country_Name Country_Code Income_group year percentage
0 0 184 0 554
From above, it can be seen that the number of missing values is now 0 for the variables ‘Country_Name’ and ‘Country_Code’.
The variable ‘Income_group’ has 184 missing values and is a categorical variable. ‘getmode’ is a custom function to find the most common categorical value to be replaced in place of missing values.
#function to find the most common category
getmode <- function(v) {
uniqv <- unique(v)
uniqv[which.max(tabulate(match(v, uniqv)))]
}
income_mode <- getmode(dataset_join$Income_group)
#Replacing the missing values with the most common category
dataset_join$Income_group[which(is.na(dataset_join$Income_group))] <- income_mode
colSums(is.na(dataset_join))
Country_Name Country_Code Income_group year percentage
0 0 0 0 554
head(dataset_join)
Now, the ‘Percentage’ variable has 554 missing values. Since the dataset is in long format, for each country code, we have 4 values, corresponding to the years 2014, 2015, 2016 and 2017. In order to impute the missing percentage values with their means, we group the dataset by country code and get the mean for each country and impute them in the missing values.
dataset_impute <- dataset_join %>% group_by(Country_Code) %>% mutate(percentage = ifelse(is.na(percentage), mean(percentage, na.rm = TRUE), percentage))
colSums(is.na(dataset_impute))
Country_Name Country_Code Income_group year percentage
0 0 0 0 392
There are now 392 missing values in the percentage variable, which are actually special values ‘NaN’, caused because of unavailble means.
sum(is.nan(dataset_impute$percentage))
[1] 392
This means that these countries don’t have any percentage values recorded through the years 2014-2017. Hence there is no mean value to impute the missing values. Since these values are unavailable in the dataset, we omit them.
dataset_impute2 <- na.omit(dataset_impute)
colSums(is.na(dataset_impute2))
Country_Name Country_Code Income_group year percentage
0 0 0 0 0
Tidy & Manipulate Data II
To find the growth in employment to population ratio between the years 2014 and 2017, we convert the dataset from long format to wide format for ease of calculation and compute the growth(values of 2017 - 2014).
#converting the year variable into wide format in order to calculate growth
wideFormat <- spread(dataset_impute2, key = year, value = percentage)
wideFormat <- wideFormat %>% mutate(growth=(`2017`-`2014`))
head(wideFormat)
Upon calculating the growth, the dataset is tidied back up into ‘longFormat’ and year variable’s datatype is converted from character to integer.
#Converting back to long format
longFormat <- wideFormat %>% gather(key = "year", value= "percentage", 4:7 )
longFormat$year <- as.integer(longFormat$year)
head(longFormat)
NA
Scan II
From above, it can be seen that ‘growth’, ‘year’ and ‘percentage’ are the only numeric variables that needs to be scanned for outliers. Since the variable ‘year’ was created as part of the tidy step, it only has four values, 2014-2017. It therefore doesn’t require any checks for outliers. Proceeding to check for outliers in the variables ‘growth’ and ‘year’ using the z-score method from outliers package.
#checking for outliers in 'growth' variable
z.scoreGrowth <- longFormat$growth %>% scores(type = "z")
length(which( abs(z.scoreGrowth) >3 ))
[1] 4
#checking for outliers in 'percentage' variable
z.scorePercentage <- longFormat$percentage %>% scores(type = "z")
length(which( abs(z.scorePercentage) >3 ))
[1] 1
It can be seen that there are 4 outliers in ‘growth’ variable and 1 outlier in ‘percentage’ variable. These outlier values belong to a country’s growth and percentage of employment to population ratio respectively. Since these are observations of a country’s performance, they cannot be omitted, deleted, imputed or capped and are treated as valid data and not outliers. In order to reduce the variation caused by these values in the ‘percentage’ variable, we apply the below explained transformations.
---
title: "MATH2349 Semester 2, 2019"
author: "Syeda Hafsa Mujeeb (s3796697), 
         Vaishnaavee Rajaraman (s3813544), 
         Vidavaluru Sai Haneesha (s3799365)"
subtitle: Assignment 3
output:
  html_notebook: default
---

## Required packages 

Below packages are installed to enhance and add features to the existing base R functionalities

```{r}
library(readr)
library(dplyr)
library(tidyr)
library(Hmisc)
library(outliers)
library(forecast)
```


## Executive Summary 

#### Aim:
The aim of the below preprocessing steps is to create a clean, understandable dataset from the raw data available.

#### Dataset:
Two datasets, one containing each country's economic status and the other containing the details of their population's employment ratio were taken, analyzed, cleaned and preprocessed. 

#### Procedure:
* The two raw datasets were subsetted, renamed and joined together to form a single dataset with only the relevant information needed for analysis
* Depending on the types of variables, the necessary type conversions were applied
* The dataset was then tidied into long format and type conversions applied wherever necessary
* The tidied dataset was then checked for missing values and necessary techniques were applied to handle them
* A new variable was computed from the existing values to calculate the growth of each country in 4 years. This step involved converting the long format to wide format(for ease of calculation), calculating the growth and converting back to long format, with necessary data type conversions along the way
* The numeric variables were then checked for outliers and analysed
* The distribution of data was visualized and necessary transformations applied to render a normlly distributed data


## Data

The dataset contains the details of Employment to population ratio of each country. The ages considered for this analysis is in the range of 15-24 years, which is generally considered as the youth population. The second dataset that is considered for the analysis is the metadata that contains all the information of each country, such as Income group, currency, latest population, industrial and trade data, accounting concept, etc. These datasets has been taken from World Banks's open source data(https://data.worldbank.org).

```{r}
#Importing dataset
rawDataset <- read_csv("Employment_stats.csv")
metadata <- read_csv("Country_Metadata.csv")
```
```{r}
head(rawDataset)
head(metadata)
```

For the purpose of preparing the data for analysis, the two datasets are merged as one with only the required variables. This is done step by step as below:

1. Since the metadata contains all the information about each country, we are going to select only the variables we need(Country code/Code and Income Group) and store it in 'IncomeGroup' data frame

2. Since the country code is the variable we are to use for joining the metadata with the rawDataset, we need to ensure that te variable names match. We thus rename the variables of the 'IncomeGroup' dataset

```{r}
#Selecting only the variables needed
IncomeGroup <- metadata %>% select(Code, `Income Group`)

#Renaming 'Code' as 'Country_Code' for merging
IncomeGroup <- rename(IncomeGroup,Country_Code=Code,Income_group='Income Group')

head(IncomeGroup)
```

3. The rawDataset contains some unwanted information (Series Code and Series Name) that are repetitive and insignificant for our analysis. So the rawDataset is subsetted with only the variables needed, and renamed appropriately for ease of understanding
```{r}
#subsetting the dataset to remove the unwanted variables 'SeriesCode' and 'seriesName'
dataset_subset <- rawDataset %>% select(-c('Series Code','Series Name'))

dataset_subset <- rename(dataset_subset, Country_Name='Country Name', Country_Code='Country Code')
names(dataset_subset)[3] <- 2014
names(dataset_subset)[4] <- 2015
names(dataset_subset)[5] <- 2016
names(dataset_subset)[6] <- 2017
head(dataset_subset)
```

4. The two datasets are now merged using left_join() function, by the common Country_Code variable
```{r}
#Joining 'dataset_subset' and 'IncomeGroup'
dataset_join <- left_join(dataset_subset, IncomeGroup, by="Country_Code" )
head(dataset_join)

```

The variables in the dataset after joining are:

```{r}
colnames(dataset_join)
```
Country_Name: Names of all countries, arranged alphabetically

Country_Code: A unique code for each country, used for merging the two datasets

2014-2017: Employment to population ratio of each country through the years 2014 - 2017

Income_group: Economy status of each country


## Understand 

Using str() function, the structure of the joined dataset(dataset_join) can be seen as:
```{r}
str(dataset_join)
```

From the structure, it can be seen that the data types of 'Country_Name', 'Country_Code' and 'Income_group' is character and the rest are numeric. But since the 'Income_Group' variable contains categorical data, it is converted to a factored variable using the factor() function and levels defined.

```{r}
#Converting data type of income group from chr to factor
dataset_join$Income_group <- factor(dataset_join$Income_group, levels = c("Low income", "Lower middle income", "Upper middle income", "High income"), ordered = TRUE)

levels(dataset_join$Income_group)
str(dataset_join)
```


##	Tidy & Manipulate Data I 

In the 'dataset_join' data frame, the column names 2014-2017 are values of year and not variable names. This calls for a transformation of the wide format to long format. gather() function is thus used to create the 'year' and 'percentage' variables. The arguments 'key' and 'value' correspond to the name of the variable whose values form the column names and the name of the variable whose values are spread over the cells.

Since the 'Year' variable is of character data type, it is converted into an integer using the as.integer() function.

```{r}
#wide to long format - tidying data
dataset_join <- dataset_join %>% gather(key = "year", value= "percentage", 3:6 )

#converting data type of 'year' variable from chr to int
dataset_join$year <- as.integer(dataset_join$year)
head(dataset_join)

```


##	Scan I 

With the tidied dataset in long format, it is now checked for missing values in all variables. colSums() gives the total number of values missing in each variable.

```{r}
#checking NA values for all variables
colSums(is.na(dataset_join))
```

From above, it can be seen that there are 20 observartions with missing 'Country_Name' and 'Country_Code'. Since the data relevent to our analysis becomes invalid without the 'Country_Name' or 'Country_Code', it needs to be excluded. But before proceeding, to make sure excluding these missing values do not create an impact on the dataset, the % of missing values is calculated for 'Country_Name' and 'Country_Code' variables. The total number of values in each of the variables can be obtained with length() function as:
```{r}
#finding length to calculate the % of missing values in country
length(dataset_join$Country_Name)
length(dataset_join$Country_Code)
```
% of missing values = (No. of missing values / Total number of values) * 100 
= (20/1072) * 100 = 1.86%

Since the percentage of missing values in country name and code is less than 5%, we can now omit the observations with missing values.

```{r}
dataset_join <- dataset_join[-which(is.na(dataset_join$Country_Name)),]
colSums(is.na(dataset_join))
```
From above, it can be seen that the number of missing values is now 0 for the variables 'Country_Name' and 'Country_Code'. 

The variable 'Income_group' has 184 missing values and is a categorical variable. 'getmode' is a custom function to find the most common categorical value to be replaced in place of missing values. 
```{r}
#function to find the most common category
getmode <- function(v) {
   uniqv <- unique(v)
   uniqv[which.max(tabulate(match(v, uniqv)))]
}
income_mode <- getmode(dataset_join$Income_group)

#Replacing the missing values with the most common category
dataset_join$Income_group[which(is.na(dataset_join$Income_group))] <- income_mode
colSums(is.na(dataset_join))
head(dataset_join)
```

Now, the 'Percentage' variable has 554 missing values. Since the dataset is in long format, for each country code, we have 4 values, corresponding to the years 2014, 2015, 2016 and 2017. In order to impute the missing percentage values with their means, we group the dataset by country code and get the mean for each country and impute them in the missing values.

```{r}
dataset_impute <- dataset_join %>% group_by(Country_Code) %>%  mutate(percentage = ifelse(is.na(percentage), mean(percentage, na.rm = TRUE), percentage))

colSums(is.na(dataset_impute))
```

There are now 392 missing values in the percentage variable, which are actually special values 'NaN', caused because of unavailble means.

```{r}
sum(is.nan(dataset_impute$percentage))
```

This means that these  countries don't have any percentage values recorded through the years 2014-2017. Hence there is no mean value to impute the missing values. Since these values are unavailable in the dataset, we omit them.

```{r}
dataset_impute2 <- na.omit(dataset_impute)
colSums(is.na(dataset_impute2))
```


##	Tidy & Manipulate Data II 

To find the growth in employment to population ratio between the years 2014 and 2017, we convert the dataset from long format to wide format for ease of calculation and compute the growth(values of 2017 - 2014).

```{r}
#converting the year variable into wide format in order to calculate growth
wideFormat <- spread(dataset_impute2, key = year, value = percentage)
wideFormat <- wideFormat %>% mutate(growth=(`2017`-`2014`))
head(wideFormat)
```

Upon calculating the growth, the dataset is tidied back up into 'longFormat' and year variable's datatype is converted from character to integer.

```{r}
#Converting back to long format
longFormat <- wideFormat %>% gather(key = "year", value= "percentage", 4:7 )
longFormat$year <- as.integer(longFormat$year)
head(longFormat)

```


##	Scan II

From above, it can be seen that 'growth', 'year' and 'percentage' are the only numeric variables that needs to be scanned for outliers. Since the variable 'year' was created as part of the tidy step, it only has four values, 2014-2017. It therefore doesn't require any checks for outliers. Proceeding to check for outliers in  the variables 'growth' and 'year' using the z-score method from outliers package. 

```{r}
#checking for outliers in 'growth' variable
z.scoreGrowth <- longFormat$growth %>%  scores(type = "z")
length(which( abs(z.scoreGrowth) >3 ))

#checking for outliers in 'percentage' variable
z.scorePercentage <- longFormat$percentage %>%  scores(type = "z")
length(which( abs(z.scorePercentage) >3 ))
```
It can be seen that there are 4 outliers in 'growth' variable and 1 outlier in 'percentage' variable. These outlier values belong to a country's growth and percentage of employment to population ratio respectively. Since these are observations of a country's performance, they cannot be omitted, deleted, imputed or capped and are treated as valid data and not outliers. In order to reduce the variation caused by these values in the 'percentage' variable, we apply the below explained transformations.     


##	Transform 

The histogram of 'percentage' variable is plotted to analyse the distribution.

```{r}
hist(longFormat$percentage)
```

It can be seen that the distribution is not normal. Since it is right skewed, we consider taking square root(using sqrt() function) and logarithm(using log10() function) to see which transformation works best.

```{r}
percentage_log <- log10(longFormat$percentage)
hist(percentage_log)

percentage_sqrt <- sqrt(longFormat$percentage)
hist(percentage_sqrt)
```
From the above histograms, we can see that the square root transformation has reduced the skewness and has improved the symmetry of the 'percentage' distribution.

<br>
<br>

