Required packages

library(dplyr) #Library for manipulating data
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr) #Library for tidy data
library(readr) # Library for importing data
library(readxl) #Library for importing excel files
library(ggplot2) #library for making plots
library(outliers) #Library for detecting outliers
library(forecast) #Library for BoxCox
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

Executive Summary

Data preprocessing is a critical step in any data science study. In this presentation, we will demonstrate the various steps needed in this process. For this paper we will demonstrate using World Bank Data that is untidy and a second dataset containing regional and income classifications of various countries. First is to understand the dataset about to be used. Our first dataset was in the wrong format and very untidy. We first changed it into a tidy format using gather and spread. We then merge it the second dataset using left_join. We will then understand the various variables of the second dataset using str(). We then convert the variables into their proper data types. Next we demonstrate R’s ability to add a new variable to the data by using the mutate() functions. We then check all these variables for missing values, special characters and numbers. The next step is to check for outliers. One of the simplest ways of doing this is to use the boxplot. All numerical variables are given boxplots to examine visually for outliers. Once outliers are found, a decision needs to be made on how to handle them. For this world bank dataset, all the outliers are very important observations and not a result of data entry error. Thus it was decided to leave them as they as they provide valuable information. However a short demonstration was given to see how outliers can be handled. Finally, data transformation was conducted on one variable to give it a normal distribution. Normal distributions are very important for statistical research. These are the basic steps in data preprocessing which are necessary before any data analysis can be done.

Tidy & Manipulate Data I

This first dataset is Untidy. Before we can even discuss its variables, we need to manipulate the data into a tidy form. In its current format, the columns are countries and the rows are the variables. Tidy data must have observations at the rows, in this case, countries must be at rows. In addition, variables should be in columns. We will also only use a certain subset of this data since the other variable only containing code information which is not needed. To achieve this, will use select(). Column names are also replaced with simpler names for easier calling and to match the key to be used on a later join. After this is completed, the spread and gather functions will be used to make the data tidy. First gather() will be used to place all the countries at the rows. Then spread() will be used to spread the variables around into their correct position. We will view the new tidy format using head().

#Read the first Dataset
countryWB<- read_csv('dataworld.csv')
## Parsed with column specification:
## cols(
##   .default = col_double(),
##   Time = col_character(),
##   `Time Code` = col_character(),
##   `Series Name` = col_character(),
##   `Series Code` = col_character()
## )
## See spec(...) for full column specifications.
head(countryWB)
#Get the subset of only the first four rows to remove additional information at the bottom of the data
countryWB<-countryWB[0:6,]

#Change the Column names for easier calling
colnames(countryWB)<-c('Year','Code','Variable','TableCode', 'Australia', 'Austria', 'Belgium','Brazil','Canada','China','Colombia','Denmark','France','Germany','Iceland','India','Indonesia','Israel','Italy','Japan','Jordan','Kenya','Malaysia','Mexico','Netherlands','New Zealand','Pakistan','Philippines','Poland','Spain','Sri Lanka','Thailand','United Arab Emirates','United States')

#Subset the part of the data that will be used
newpop<- countryWB%>% select(Variable, Australia:`United States`)

#Use Gather() to collect the countries and place them as observations
newpop<- newpop %>% gather('Australia', 'Austria', 'Belgium','Brazil','Canada','China','Colombia','Denmark','France','Germany','Iceland','India','Indonesia','Israel','Italy','Japan','Jordan','Kenya','Malaysia','Mexico','Netherlands','New Zealand','Pakistan','Philippines','Poland','Spain','Sri Lanka','Thailand','United Arab Emirates','United States',key = "country", value = "population")

#Use Spread() to place the variable in columns
newpop<- newpop %>% spread(key = Variable, value = population )

head(newpop)

Data

There are two datasets to be used in this report.

The first dataset comes from the world bank. It contains World Development indicators for various countries like GDP, total population size, crude birth rate and crude death rate for the year 2018. It shows these indicators for a randomly sampled number of countries. This data was untidy with the variable in the rows and the observations/countries in the column. They were tidied first before discussing them here.

The first dataset was read using read_csv. Then we subset only the first six rows of the dataset since the csv contains extra information at the bottom that is not needed. Column names(Country names) were changed using the colnames() function as we will be using this as the key for the merging.

The Variables are as follows

Country - Describes the name of the country

Birth Rate - Describes the Birth Rate of the country per 1000 people

Death Rate - Describes the Death Rate of the country per 1000 people

Population - Describes the total population of the country

Consumer Prices Index - Describes the increase in prices compared to 2010 of goods

GDP growth - Describes the growth of GDP from the previous year

GDP, ppp - Describes the Global Domestic Products of country which is one measure of total economic activity of a country

The second dataset comes from the world bank helpdesk. It is a table containing the various classifications countries are being classified as. To create more meaningful data, the world bank has created some aggregate classes for countries. For example, there could be some value in comparing behaviors between high income countries and low income countries. For this paper we will be focusing on two classifications: the region where a country belongs and the income level the country is classified under.

The second dataset was read using read_excel file. The first five rows were skipped as they did not contain important values. Class was also replaced by a subset of only the relevant variables. The other variables will not be useful and as such were not included in the selection. This was achieve using the select function.

The Variables are as follows

Economy- Describes the name of the country

Region - Describes the region where the country belongs

Income Group - Describes the Income Group the country belong to

To join these two datasets, left_join will be used with country as the key. This new dataset will contain all the observations of the first dataset. It will look for the corresponding key in the second dataset, and add the variables based on the match. In this manner, we can get more information on the first dataset. The new dataset was named newset.

To display the old datasets and the new one, we will be using the head() function.

#View the first Dataset
head(newpop)
class <- read_excel("CLASS.xls", skip = 5)
## New names:
## * x -> x...1
## * x -> x...2
## * x -> x...3
## * x -> x...4
## * x -> x...5
## * ...
colnames(class) <- c('','','Economy','Code','','Region','Income Group','Lending Category','Other')
class<- select(class,Economy, Region,`Income Group`)
head(class)
newset<-left_join(newpop,class, c("country"="Economy"))
head(newset)

Understand

Before doing anything more with the dataset, it is important to understand the various features. To achieve this, we will be using the dim() to check the dimension and str() to check the attributes. The data has 30 observations and 9 variables.

dim(newset)
## [1] 30  9
str(newset)
## tibble [30 × 9] (S3: tbl_df/tbl/data.frame)
##  $ country                             : chr [1:30] "Australia" "Austria" "Belgium" "Brazil" ...
##  $ Birth rate, crude (per 1,000 people): num [1:30] 12.6 9.7 10.3 13.9 10.1 ...
##  $ Consumer price index (2010 = 100)   : num [1:30] 118 116 115 161 115 ...
##  $ Death rate, crude (per 1,000 people): num [1:30] 6.3 9.5 10.7 6.45 7.7 ...
##  $ GDP growth (annual %)               : num [1:30] 2.94 2.42 1.46 1.32 1.9 ...
##  $ GDP, PPP (current international $)  : num [1:30] 1.28e+12 5.03e+11 5.97e+11 3.13e+12 1.85e+12 ...
##  $ Population, total                   : num [1:30] 2.50e+07 8.84e+06 1.14e+07 2.09e+08 3.71e+07 ...
##  $ Region                              : chr [1:30] "East Asia & Pacific" "Europe & Central Asia" "Europe & Central Asia" "Latin America & Caribbean" ...
##  $ Income Group                        : chr [1:30] "High income" "High income" "High income" "Upper middle income" ...

The first seven variables require no change. They are in the correct format. Region and Income group however should be converted into factors as they categorical classifications. In future studies, it will be useful to treat them as factors. This can be achieved through the factor() functions. In addition, it will also be helpful to order the levels in the Income Group variable. This will be achieved by adding the code ordered=TRUE). We check if the factor conversion is complete by using is.factor.

newset$`Income Group`<- factor(newset$`Income Group`,
                          levels = c("High income","Upper middle income","Lower middle income","Low income"),ordered=TRUE)

newset$Region<- factor(newset$Region,
                          levels = c("East Asia & Pacific","Europe & Central Asia", "Latin America & Caribbean","Middle East & North Africa", "North America", "South Asia", "Sub-Saharan Africa"))

is.factor(newset$`Income Group`)
## [1] TRUE
is.factor(newset$Region)
## [1] TRUE
str(newset)
## tibble [30 × 9] (S3: tbl_df/tbl/data.frame)
##  $ country                             : chr [1:30] "Australia" "Austria" "Belgium" "Brazil" ...
##  $ Birth rate, crude (per 1,000 people): num [1:30] 12.6 9.7 10.3 13.9 10.1 ...
##  $ Consumer price index (2010 = 100)   : num [1:30] 118 116 115 161 115 ...
##  $ Death rate, crude (per 1,000 people): num [1:30] 6.3 9.5 10.7 6.45 7.7 ...
##  $ GDP growth (annual %)               : num [1:30] 2.94 2.42 1.46 1.32 1.9 ...
##  $ GDP, PPP (current international $)  : num [1:30] 1.28e+12 5.03e+11 5.97e+11 3.13e+12 1.85e+12 ...
##  $ Population, total                   : num [1:30] 2.50e+07 8.84e+06 1.14e+07 2.09e+08 3.71e+07 ...
##  $ Region                              : Factor w/ 7 levels "East Asia & Pacific",..: 1 2 2 3 5 1 3 2 2 2 ...
##  $ Income Group                        : Ord.factor w/ 4 levels "High income"<..: 1 1 1 2 1 2 2 1 1 1 ...

Tidy & Manipulate Data II

In many situations, additional data may be needed that was not present in the dataset but is available after some manipulation of the existing variables. To demonstrate this, we will used the mutate function to add the GDP per capita column in the new dataset. GDP per capita is taken by dividing GDP by Total population. We will use the mutate(). We will also use head() to view the new dataset.

newset<- mutate(newset,`GDP per Capita` = `GDP, PPP (current international $)`/`Population, total`)

head(newset)

Scan I

Scanning for missing values and special values is important in the preparation of any data. These kinds of values can alter and skew future data modeling. It is critical to address them in the data preparation step. To scan for missing values we use the is.na function. To scan the entire data frame, we can use a combination sum() and is.na(). On some occasions, it is also helpful to colsum() and is.na() to identify which column contain the missing value. In this case, there are no missing values so no further action is needed.

To scan for the special characters, we use a special function calls is.special. This function combines is.infinite and is.nan which checks for special numerical values. To apply this to the entire dataframe, we have to use the sapply function. There are no special values so no further action is needed.

sum(is.na(newset))
## [1] 0
colSums(is.na(newset))
##                              country Birth rate, crude (per 1,000 people) 
##                                    0                                    0 
##    Consumer price index (2010 = 100) Death rate, crude (per 1,000 people) 
##                                    0                                    0 
##                GDP growth (annual %)   GDP, PPP (current international $) 
##                                    0                                    0 
##                    Population, total                               Region 
##                                    0                                    0 
##                         Income Group                       GDP per Capita 
##                                    0                                    0
is.special <- function(x){
  if (is.numeric(x)) (is.infinite(x) | is.nan(x))
}
sapply(newset, function(x) sum( is.special(x) ))
##                              country Birth rate, crude (per 1,000 people) 
##                                    0                                    0 
##    Consumer price index (2010 = 100) Death rate, crude (per 1,000 people) 
##                                    0                                    0 
##                GDP growth (annual %)   GDP, PPP (current international $) 
##                                    0                                    0 
##                    Population, total                               Region 
##                                    0                                    0 
##                         Income Group                       GDP per Capita 
##                                    0                                    0

Scan II

Scanning for outliers is very important. To do this we must first generate boxplots. From visual inspection, it can be seen that population, Crude Birth Rate, Consumer Price Index and GDP both have very significant outliers. However, upon further inspection of the data, it was decided it was best to keep them. These outliers were not created due to data errors or measurement errors. They are observations that still provide valuable information. For this case, it will be more beneficial to leave them in the data.

However, this paper will demonstrate how to deal with outliers by using the Crude Birth Rate as an Example. This data however will not be kept and will be shown only through a dummy boxplot. Outliers can be treated using the capping method as demonstrated below. Capping is a useful method when outlier values are limited to the nearest value that are not outliers. First we create a cap function() to determine the different quartiles and replace any outliers with its nearest fence. Then we show the boxplot to reflect the change. This is only one method of dealing with outliers. From the generated boxplot, it can be see the outlier has been removed. In this dataset however, outliers should not be removed. This section is only a demonstration of a possible method to deal with outliers.

boxplot(newset$`Population, total`/1000000, main="Population", ylab="Per million")

boxplot(newset$`Birth rate, crude (per 1,000 people)`, main="Crude Birth Rate", ylab="Per 1000 people")

boxplot(newset$`Death rate, crude (per 1,000 people)`, main="Crude Death Rate", ylab="Per 1000 people")

boxplot(newset$`GDP growth (annual %)`, main="GDP growth annual", ylab="Percentage")

boxplot(newset$`GDP, PPP (current international $)`/1000000000, main="GDP", ylab="USD per billion")

boxplot(newset$`Consumer price index (2010 = 100)`, main="Consumer Price Index", ylab="CPI")

boxplot(newset$`GDP per Capita`, main="GDP per Capita")

cap <- function(x){
  quantiles <- quantile( x, c(.05, 0.25, 0.75, .95 ) )
  x[ x < quantiles[2] - 1.5*IQR(x) ] <- quantiles[1]
  x[ x > quantiles[3] + 1.5*IQR(x) ] <- quantiles[4]
  x
}

capping <- newset$`Birth rate, crude (per 1,000 people)` %>% cap()
boxplot(capping, main="Crude Birth Rate w/o Outliers", ylab="Per 1000 people")

Transform

In many data analysis, a normal distribution is desired. There are many properties to normal distribution that can help in analysis. To this end, it can useful to transform data into a normal distribution. This can be done through the application of various transformations like log10(), log(), sqrt() and reciprocal transformation. These transformation are created by applying numerical the numerical operations on the data with the goal of creating a normal distribution. We will demonstrate this principle below using the GDP annual growth function. From the various generated histograms, it seems that the sqrt() transformations was the most effective.

x<-newset$`GDP growth (annual %)`
hist(x)

log <- log10(x)
hist(log)

ln <- log(x)
hist(ln)

hist(sqrt(x))

hist(x^2)

hist(1/x)

References

World Bank Group, 2020, World Development Indicators, World Development Indicators, World Bank Group viewed 5 June 2020 https://databank.worldbank.org/source/world-development-indicators

World Bank Group, 2020, World Bank Country and Lending Groups,current classification by income in XLS format, data file, World Bank Group viewed on 5 Junes 2020 https://datahelpdesk.worldbank.org/knowledgebase/articles/906519-world-bank-country-and-lending-groups