library(readr)
library(magrittr)
library(tidyr)
library(dplyr)
Multiple preprocessing techniques were used in this report to better understand and visualize two World Bank data sets. Both data sets were considered untidy and preprocessing was required to help conform with tidy data principles. The data sets were then merged in a natural join and variable data types were checked for accuracy. The resulting data frame contained character, numeric and factor data types. A new variable was then created using variables from each data set. The data was then scanned for missing values, errors and outliers. Finally, a transformation was done on one of the variables to reduce its skewness.
Whilst the original data sets were quite large, containing thousands of values, I have opted to select only the variables most relevant to this report for efficiency reasons.
Two open source data sets were acquired from the World Bank using different databases. The first data set was imported from the World Development Indicators database and it lists development indicators by country for the last 25 years (1995 to 2019). Some examples of variables include “access to electricity”, “individuals using the internet”, “coverage of social protection and labor programs” etc. The second data set was imported from the Population Statistics database and it lists basic population data sorted by country for the same time period. Some examples of the variables include “male population”, “female population”, “total population” etc.
Both sets contain a ‘Country Name’ variable which will be used to merge later in the report. More variable descriptions will be provided once the data set has been tidied (please see the bottom of the ‘Understand’ section). In the interests of brevity I have opted to shorten the data set to the last 5 years as it does not change the wrangling techniques used later on.
#Import then select data from 2015 to 2019
pop <- read_csv("~/Documents/Analytics/Data Wrangling (MATH2349)/Assignment 2/population data.csv") %>% select(c(1:4, 25:29))
Parsed with column specification:
cols(
.default = col_character()
)
See spec(...) for full column specifications.
head(pop)
dev <- read_csv("~/Documents/Analytics/Data Wrangling (MATH2349)/Assignment 2/development indicators.csv") %>% select(c(1:4, 25:29))
Parsed with column specification:
cols(
.default = col_character()
)
See spec(...) for full column specifications.
head(dev)
NA
Both data sets do not conform to tidy data principles as every single year has its own separate column despite being a variable. Also, the column “Series Name” has multiple variables. In this section I will first combine years with the gather() function and then split the “Series Name” column into separate columns using the spread() function. The data sets are then merged using a natural join (they will be merged using the variables “Country Name” and “Year”)
#Starting with the population data set, gather all years into a single column and drop the redundant variables "Country Code" and "Series Code"
pop_gather <- pop %>% gather(5:9, key = "Year", value = "Indicator") %>% select(c(1, 3, 5, 6))
#Split the "Series Name" column into multiple columns using spread()
pop_spread <- spread(pop_gather, key = "Series Name", value = "Indicator")
#Do the same for the Development Indicators data set
dev_gather<- dev %>% gather(5:9, key = "Year", value = "Indicator") %>% select(c(1, 3, 5, 6))
dev_spread <- spread(dev_gather, key = "Series Name", value = "Indicator")
#Both data sets are now considered tidy - all variables have their own columns and each observation has its own row.
head(pop_spread)
head(dev_spread)
#The data sets are then merged using a natural join
pop_dev <- pop_spread %>% left_join(dev_spread)
Joining, by = c("Country Name", "Year")
head(pop_dev)
Here we check the data types of all the variables.
#Summary of the data types within pop_dev
str(pop_dev)
tibble [1,320 × 19] (S3: tbl_df/tbl/data.frame)
$ Country Name : chr [1:1320] "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
$ Year : chr [1:1320] "2015 [YR2015]" "2016 [YR2016]" "2017 [YR2017]" "2018 [YR2018]" ...
$ Population, female : chr [1:1320] "16727437" "17196083" "17644266" "18079105" ...
$ Population, female (% of total population) : chr [1:1320] "48.60704937" "48.59966761" "48.61161572" "48.6358467" ...
$ Population, male : chr [1:1320] "17686166" "18187045" "18652134" "19093281" ...
$ Population, male (% of total population) : chr [1:1320] "51.39295063" "51.40033239" "51.38838428" "51.3641533" ...
$ Population, total : chr [1:1320] "34413603" "35383128" "36296400" "37172386" ...
$ Rural population : chr [1:1320] "25877997" "26530269" "27131559" "27695286" ...
$ Rural population (% of total population) : chr [1:1320] "75.197" "74.98" "74.75" "74.505" ...
$ Rural population growth (annual %) : chr [1:1320] "2.79024919" "2.48932366" "2.241128152" "2.056462912" ...
$ Urban population : chr [1:1320] "8535606" "8852859" "9164841" "9477100" ...
$ Urban population (% of total population) : chr [1:1320] "24.803" "25.02" "25.25" "25.495" ...
$ Urban population growth (annual %) : chr [1:1320] "3.951762191" "3.649410221" "3.463407483" "3.350382976" ...
$ Access to clean fuels and technologies for cooking (% of population): chr [1:1320] "30.1" "32.44" ".." ".." ...
$ Access to electricity (% of population) : chr [1:1320] "71.5" "97.7" "97.7" "98.71320343" ...
$ Access to electricity, rural (% of rural population) : chr [1:1320] "64.57335394" "97.0993598" "97.09197324" "98.27287218" ...
$ Access to electricity, urban (% of urban population) : chr [1:1320] "92.5" "99.5" "99.5" "100" ...
$ Coverage of social protection and labor programs (% of population) : chr [1:1320] ".." ".." ".." ".." ...
$ Individuals using the Internet (% of population) : chr [1:1320] "8.26" ".." "11.44768809" ".." ...
We can see that the data frame pop_dev has 19 variables and 1320 observations. The only variable that has the correct data type is “Country Name” (character). The “Year” variable should be a factor and everything else should be numeric.
#Converting the "Year" variable from character to factor and ordering
pop_dev$Year <- pop_dev$Year %>% factor(levels = c("2015 [YR2015]", "2016 [YR2016]", "2017 [YR2017]", "2018 [YR2018]", "2019 [YR2019]"), ordered = TRUE)
levels(pop_dev$Year)
[1] "2015 [YR2015]" "2016 [YR2016]" "2017 [YR2017]"
[4] "2018 [YR2018]" "2019 [YR2019]"
Whilst the data set contains 19 variables I have opted to select the ones most relevant to this analysis. The remaining indicator variables are then converted to the numeric class.
#Relevant indicators selected
pop_dev2 <- pop_dev %>% select(c(1, 2, 7, 15))
#The remaining indicator variables are then converted to the numeric class
pop_dev2$`Population, total` <- as.numeric(pop_dev2$`Population, total`)
NAs introduced by coercion
pop_dev2$`Access to electricity (% of population)` <- as.numeric(pop_dev2$`Access to electricity (% of population)`)
NAs introduced by coercion
#Summary of the variables after all conversions are finished
str(pop_dev2)
tibble [1,320 × 4] (S3: tbl_df/tbl/data.frame)
$ Country Name : chr [1:1320] "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
$ Year : Ord.factor w/ 5 levels "2015 [YR2015]"<..: 1 2 3 4 5 1 2 3 4 5 ...
$ Population, total : num [1:1320] 34413603 35383128 36296400 37172386 38041754 ...
$ Access to electricity (% of population): num [1:1320] 71.5 97.7 97.7 98.7 NA ...
Variable Descriptions We can now see that the data frame has 4 remaining variables:
In this section the data will be mutated to find the total number of people with access to electricity by country in the year 2018. This number will be derived from multiplying the percentage of population by the total population.
#Select data from the year 2018
popdev_2018 <- subset(pop_dev2, Year == '2018 [YR2018]') %>% select(c(1, 3, 4))
#Finding total number of people with access to electricity
popdev_2018 <- mutate(popdev_2018, "# of ppl with access to elec" = (`Access to electricity (% of population)` / 100) * `Population, total` )
str(popdev_2018)
tibble [264 × 4] (S3: tbl_df/tbl/data.frame)
$ Country Name : chr [1:264] "Afghanistan" "Albania" "Algeria" "American Samoa" ...
$ Population, total : num [1:264] 37172386 2866376 42228429 55465 77006 ...
$ Access to electricity (% of population): num [1:264] 98.7 100 100 NA 100 ...
$ # of ppl with access to elec : num [1:264] 36694053 2866376 42228429 NA 77006 ...
head(popdev_2018)
NA
Now we scan the data for missing values, special values and obvious errors (i.e. inconsistencies).
#Scan for NaN values
sapply(popdev_2018, function(x) sum(is.na(x)))
Country Name
0
Population, total
2
Access to electricity (% of population)
3
# of ppl with access to elec
4
#Scan for infinite values
sapply(popdev_2018, function(x) sum(is.infinite(x)))
Country Name
0
Population, total
0
Access to electricity (% of population)
0
# of ppl with access to elec
0
#Scan for NA values
colSums(is.na(popdev_2018))
Country Name
0
Population, total
2
Access to electricity (% of population)
3
# of ppl with access to elec
4
sum(is.na(popdev_2018))
[1] 9
There are no infinite values, but a few NA and NaN values. There are the same amount of NA values as there are NaN values. As all indicator variables are numeric, every missing value on the document will be treated as an NA value.
In order to reduce errors in the mutated variables we must look at its constituents. We can see from the scan that there are a combined 5 errors for the two original variables.
#Find which countries and years are missing population data
popdev_2018[which(is.na(popdev_2018$'Population, total')), "Country Name"]
#Find which countries and years are missing 'Access to electricity' data
popdev_2018[which(is.na(popdev_2018$'Access to electricity (% of population)')), "Country Name"]
NA
NA
An obvious error can be seen in the country name ‘Not classified’, there appears to be no data for this at all.
The observation Eritrea is missing its ‘Population, total’ data. This is not present in the original data set where no values were recorded past 2011. Luckily there is population data for Eritrea in the year 2018 available online (United Nations, 2019).
#Input the united nations Eritrea population data to replace NA value
popdev_2018[popdev_2018$`Country Name` == "Eritrea", "Population, total"] <- 3452786
#Re-run prev code to fill out Eritrea row
popdev_2018 <- mutate(popdev_2018, "# of ppl with access to elec" = (`Access to electricity (% of population)` / 100) * `Population, total` )
The remaining countries American Samoa and the Channel Islands have NA values in the ‘Access to electricity’ variable. Unfortunately, this information is not as readily available and I have opted to omit. It would not make sense to impute a measure of centre into the data as too many factors go into determining a country’s ‘Access to electricity’ percentage. This is a variable that cannot be generalised.
#Remove all remaining observations with NA values
popdev_2018 <- na.omit(popdev_2018)
#Remaining missing values
colSums(is.na(popdev_2018))
Country Name
0
Population, total
0
Access to electricity (% of population)
0
# of ppl with access to elec
0
sum(is.na(popdev_2018))
[1] 0
In order to determine what techniques we use to address outliers we must first determine the type of distribution of each variable. This can be done using histograms.
#Histogram of total population
popdev_2018$`Population, total` %>% hist(xlab="Population, total", main="Total Population Distribution", breaks = 70)
#Histogram of electricity access
popdev_2018$`Access to electricity (% of population)` %>% hist(xlab="Access to electricity (% of pop)", main="Electricity Access Distribution", breaks = 70)
#Histogram of total number of people with electricity
popdev_2018$`# of ppl with access to elec` %>% hist(xlab="# of ppl with access to elec", main="Number of people with access to electricity", breaks = 70)
From the histograms we can see that all three data sets are heavily skewed. This suggests the best way to detect outliers will be through ‘Turkey’s method of outlier detection’.
#Plots variables using box plots
popdev_2018$`Population, total` %>% boxplot(ylab="Population, total")
popdev_2018$`Access to electricity (% of population)` %>% boxplot(ylab="Access to electricity (% of pop)")
popdev_2018$`# of ppl with access to elec` %>% boxplot(ylab="# of ppl with access to elec")
NA
From the box plots we can see that each variable has multiple outliers. Capping/Winsorising will be used to adjust the data set. Classical statistics (mean, median etc) are sensitive to extreme values and this is a symmetric process that will help make the statistics more ‘robust’.
#Function for capping outliers
cap <- function(x){
quantiles <- quantile( x, c(.05, 0.25, 0.75, .95 ) )
x[ x < quantiles[2] - 1.5*IQR(x) ] <- quantiles[1]
x[ x > quantiles[3] + 1.5*IQR(x) ] <- quantiles[4]
x
}
#Apply capping function to each variable and re-print boxplot
pd_capped <- popdev_2018$`Population, total` %>% cap()
boxplot(pd_capped, ylab="Population, total")
elec_capped <- popdev_2018$`Access to electricity (% of population)` %>% cap()
boxplot(elec_capped, ylab="Access to electricity (% of pop)")
totalnum_capped <- popdev_2018$`# of ppl with access to elec` %>% cap()
boxplot(totalnum_capped, ylab="# of ppl with access to elec")
From the histograms in section ‘Scan II’ we can see that the data set for the total population is right skewed. The data will be transformed using a log function to decrease skewness and convert the distribution into a normal distribution.
#log function plotted
pop_log <- popdev_2018$`Population, total` %>% log()
hist(pop_log)
The data now follows a normal distribution much more closely.
The World Bank 2019, Health Nutrition and Population Statistics, Data File, World Bank Group, viewed 18 October 2020, https://databank.worldbank.org/source/health-nutrition-and-population-statistics.
The World Bank 2019, World Development Indicators, Data File, World Bank Group, viewed 18 October 2020, https://databank.worldbank.org/source/world-development-indicators.
United Nations 2019, World Population Prospects, Data File, United Nations, viewed 18 October 2020, https://population.un.org/wpp/Download/.