library(readr)
## Warning: package 'readr' was built under R version 3.6.1
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.6.1
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
## Warning: package 'tidyr' was built under R version 3.6.1
library(lubridate)
## Warning: package 'lubridate' was built under R version 3.6.1
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
##
## date
library(forecast)
## Warning: package 'forecast' was built under R version 3.6.1
## Registered S3 method overwritten by 'xts':
## method from
## as.zoo.xts zoo
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
## Registered S3 methods overwritten by 'forecast':
## method from
## fitted.fracdiff fracdiff
## residuals.fracdiff fracdiff
library(Hmisc)
## Warning: package 'Hmisc' was built under R version 3.6.1
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## Loading required package: ggplot2
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
##
## src, summarize
## The following objects are masked from 'package:base':
##
## format.pval, units
library(outliers)
library(ggplot2)
setwd("~/R")# set working directory
To initiate with our proceedings, we are required to install and run some pre-requisite R packages into our working space. For this assignment, we ran readr, dplyr , tidyr, lubridate, forecast, hmisc, lattice, survival, Formula, ggplot2 and outliers packages. These packages will enable us to carry out various data pre-processing tasks which will assist us in reaching our project conclusion successfully.
We initiated our tasks selecting two datasets with 263 observations and 16 variables, reading them into the software and then joining them. We filtered out the non-contributing variables. We tidied up the data and created another variable. We then scanned our data for missing values, inconsistencies and obvious errors. We computed the missing values in every variable. We then summed up all of these values to get a total of missing values in our dataset which turned out to be 388. We then replaced the missing values with mean values of the respective variables. Next, we scanned our data for outliers. We used a distance-based method to detect the univariate outliers i.e. the z- scores (i.e., normal scores) method. We substituted the outliers by mean value of the respective variables. As transforming variables can also eliminate outliers, natural logarithm of a value reduces the variation caused by outliers. Binning is also a form of variable transformation. Hence, no method for removing the outliers were applied as there were just 4 outliers and are removed and normalized in next step. We then applied an appropriate transformation for a variable. We plotted a histogram of a variable and observed a leftward skewness. Ultimately, we rectified the same. the folowing are the variables in the dataset :-
series name : name of series of data (data type char) series code :code of the series (data type char) country name : name of country (data type char) country code : code of country (data type char) year 1990 [YR1990],2000 [YR2000],2009[YR2009] till 2018[YR2018]: population percentage in each year (data type char)
We are working with two datasets namely urban_females and urban_males sourced from https://databank.worldbank.org/reports.aspx?source=283&series=SP.URB.TOTL.FE.ZS and https://databank.worldbank.org/reports.aspx?source=283&series=SP.URB.TOTL.MA.ZS.
urban_female<-read_csv("urban_females.csv")
## Parsed with column specification:
## cols(
## `Series Name` = col_character(),
## `Series Code` = col_character(),
## `Country Name` = col_character(),
## `Country Code` = col_character(),
## `1990 [YR1990]` = col_character(),
## `2000 [YR2000]` = col_character(),
## `2009 [YR2009]` = col_character(),
## `2010 [YR2010]` = col_character(),
## `2011 [YR2011]` = col_character(),
## `2012 [YR2012]` = col_character(),
## `2013 [YR2013]` = col_character(),
## `2014 [YR2014]` = col_character(),
## `2015 [YR2015]` = col_character(),
## `2016 [YR2016]` = col_character(),
## `2017 [YR2017]` = col_character(),
## `2018 [YR2018]` = col_character()
## )
urban_male<-read_csv("urban_males.csv")
## Parsed with column specification:
## cols(
## `Series Name` = col_character(),
## `Series Code` = col_character(),
## `Country Name` = col_character(),
## `Country Code` = col_character(),
## `1990 [YR1990]` = col_character(),
## `2000 [YR2000]` = col_character(),
## `2009 [YR2009]` = col_character(),
## `2010 [YR2010]` = col_character(),
## `2011 [YR2011]` = col_character(),
## `2012 [YR2012]` = col_character(),
## `2013 [YR2013]` = col_character(),
## `2014 [YR2014]` = col_character(),
## `2015 [YR2015]` = col_character(),
## `2016 [YR2016]` = col_character(),
## `2017 [YR2017]` = col_character(),
## `2018 [YR2018]` = col_character()
## )
head(urban_female)
View(urban_female)
head(urban_male)
View(urban_male)
urban_population<- full_join(urban_male,urban_female)
## Joining, by = c("Series Name", "Series Code", "Country Name", "Country Code", "1990 [YR1990]", "2000 [YR2000]", "2009 [YR2009]", "2010 [YR2010]", "2011 [YR2011]", "2012 [YR2012]", "2013 [YR2013]", "2014 [YR2014]", "2015 [YR2015]", "2016 [YR2016]", "2017 [YR2017]", "2018 [YR2018]")
View(urban_population)
urban_population<-select(urban_population,- '2009 [YR2009]',- ('2011 [YR2011]':'2014 [YR2014]'),- ('2016 [YR2016]':'2018 [YR2018]'))
head(urban_population)
The urban_females dataset provides us with an information regarding the Female urban population of 263 different countries and unions through various year from 1990 to 2018 whereas the urban_males dataset provides us with an information regarding the male urban population of 263 different countries and unions through various year from 1990 to 2018. read_csv function was run to import both the data sets in to the working space and labelled them as urban_females and urban_males. head function was run to show the first 6 observations by default. view function enables us to view are datasets. We now apply full_join function on urban_females and urban_males and label the new table thus formed as urban_population. We used various steps to remove some non-contributing values and variables. Such variables would not contribute at all towards are results and conclusions. We employed select function to select all the variables minus non-required variables. We now get our dataset that we can work on.
str(urban_population)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 526 obs. of 8 variables:
## $ Series Name : chr "Urban population, male (% of total)" "Urban population, male (% of total)" "Urban population, male (% of total)" "Urban population, male (% of total)" ...
## $ Series Code : chr "SP.URB.TOTL.MA.ZS" "SP.URB.TOTL.MA.ZS" "SP.URB.TOTL.MA.ZS" "SP.URB.TOTL.MA.ZS" ...
## $ Country Name : chr "Arab World" "Caribbean small states" "Central Europe and the Baltics" "Early-demographic dividend" ...
## $ Country Code : chr "ARB" "CSS" "CEB" "EAR" ...
## $ 1990 [YR1990]: chr "26.01687791" "18.83730963" "29.62850248" "17.8551892" ...
## $ 2000 [YR2000]: chr "27.31776013" "20.04045222" "29.35812377" "19.49708011" ...
## $ 2010 [YR2010]: chr "29.44776978" "20.30356413" "29.49410413" "21.3755929" ...
## $ 2015 [YR2015]: chr "30.49376447" "20.49296219" "29.64049012" "22.44523102" ...
urban_population<-gather(urban_population, `1990 [YR1990]`:`2015 [YR2015]`,key = 'years',value = 'popultaion percentage')
head(urban_population)
urban_population$years <- factor(urban_population$years,levels = c('1990 [YR1990]','2000 [YR2000]','2010 [YR2010]','2015 [YR2015]'),labels = c(1990,2000,2010,2015),ordered = TRUE)
head(urban_population$years)
## [1] 1990 1990 1990 1990 1990 1990
## Levels: 1990 < 2000 < 2010 < 2015
urban_population$`Series Name`<-as.factor(urban_population$`Series Name`)
levels(urban_population$`Series Name`)
## [1] "Urban population, female (% of total)"
## [2] "Urban population, male (% of total)"
class(urban_population$`Series Name`)
## [1] "factor"
urban_population$`Series Code`<-as.factor(urban_population$`Series Code`)
options(digits =15) #for maintaining the number of digits after decimal place
urban_population$`popultaion percentage`<-as.double(urban_population$`popultaion percentage`)
## Warning: NAs introduced by coercion
str(urban_population)
## Classes 'tbl_df', 'tbl' and 'data.frame': 2104 obs. of 6 variables:
## $ Series Name : Factor w/ 2 levels "Urban population, female (% of total)",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ Series Code : Factor w/ 2 levels "SP.URB.TOTL.FE.ZS",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ Country Name : chr "Arab World" "Caribbean small states" "Central Europe and the Baltics" "Early-demographic dividend" ...
## $ Country Code : chr "ARB" "CSS" "CEB" "EAR" ...
## $ years : Ord.factor w/ 4 levels "1990"<"2000"<..: 1 1 1 1 1 1 1 1 1 1 ...
## $ popultaion percentage: num 26 18.8 29.6 17.9 17.3 ...
To check the attributes in the data and apply proper data type conversions we first applied str function to our dataset and we notice that all the variables are in character type. Also, our data is in wide format and so we used gather function to convert it into long format as we moved all the different years into one single years column. As the data was untidy, we did the tidy task 1 here itself. Heading back to the character data types in our data, we various functions including factor (with further arguments such as levels and labels) on variable years, as.factor on variable Series Name, as.double on the variable Series Code so that our data is read and understood properly into the software.When converting population percentage into numeric data type many values were missing and were displayed as “..” which were converted to NA as a result of conversion method. When we again verify our dataset using str function, we then get all the correct data types for each variable.
As our dataset was unable to work with, we performed our tidy and manipulate task 1 in the earlier step.
#Performed Earlier
urban_population<-mutate(urban_population,remaining_population_percent=(100.00-urban_population$`popultaion percentage`))
head(urban_population[5:7])
For the tidy task 2, we applied mutate function and created a new variable remaining_population_percent which is computed as “100 – population percentage in urban_population”.
We are now required to scan the data for missing values, inconsistencies and obvious errors.
colSums(is.na(urban_population))
## Series Name Series Code
## 0 0
## Country Name Country Code
## 0 0
## years popultaion percentage
## 0 194
## remaining_population_percent
## 194
sum(is.na(urban_population))
## [1] 388
urban_population$`popultaion percentage`<-impute(urban_population$`popultaion percentage`,fun = mean)
urban_population$remaining_population_percent<-impute(urban_population$remaining_population_percent ,fun = mean)
colSums(is.na(urban_population))
## Series Name Series Code
## 0 0
## Country Name Country Code
## 0 0
## years popultaion percentage
## 0 0
## remaining_population_percent
## 0
sum(is.na(urban_population))
## [1] 0
Therefore, in this step, we worked with the function is.na to compute the missing values in every variable. Which were 194 each in the columns remaining_population_percent and population percentage. We then summed up all of these values to get a total of missing values in our data which turned out to be 388. In the columns remaining_population_percent and population percentage, we then applied the function impute to replace the missing values with mean values of the respective variable. After the impute function was successfully run, we again verify the missing values in the data using the is.na function and we get no missing values now.
We are now required to scan the numeric data for outliers.
z.scores <- urban_population$`popultaion percentage` %>% scores(type = "z")
z.scores %>% summary()
##
## 194 values imputed to 0
## Min. 1st Qu. Median Mean
## -2.182961611610 -0.784023625989 0.000000000000 0.000000000000
## 3rd Qu. Max.
## 0.736265261731 4.450082490995
which( abs(z.scores) >3 )
## [1] 206 732 1258 1784
length (which( abs(z.scores) >3 ))
## [1] 4
urban_population$`popultaion percentage`[ which( abs(z.scores) >3 )] <- mean(urban_population$`popultaion percentage`, na.rm = TRUE)
z.scores <- urban_population$remaining_population_percent %>% scores(type = "z")
z.scores %>% summary()
##
## 194 values imputed to 0
## Min. 1st Qu. Median Mean
## -4.450082490995 -0.736265261731 0.000000000000 0.000000000000
## 3rd Qu. Max.
## 0.784023625989 2.182961611610
which( abs(z.scores) >3 )
## [1] 206 732 1258 1784
length (which( abs(z.scores) >3 ))
## [1] 4
urban_population$remaining_population_percent[which(abs(z.scores)>3)]<-mean(urban_population$remaining_population_percent, na.rm = TRUE)
Therefore, in this step, we used a distance-based method to detect the univariate outliers i.e. the z- scores (i.e., normal scores) method. In this method, a standardised score (z-score) of all observations were calculated for the variable population percentage using the following code z.scores <- urban_population$popultaion percentage%>% scores(type = "z"). We read the minimum value as -2.18 and maximum value as 4.45. We then applied the function which(abs(z.scores) >3) to identify where the outliers are. length (which(abs(z.scores) >3)) helped us to calculated how many outliers are there in the variable population percentage.
Next, the same standardised score (z-score) of all observations were calculated for the variable remaining_population_percent, this time, using the following code z.scores <-urban_population$remaining_population_percent %>% scores(type = "z"). We read the minimum value as -4.45 and maximum value as 2.18. We then again applied the function which(abs(z.scores) >3) to identify where the outliers are. length (which(abs(z.scores) >3)) helped us to calculated how many outliers are there in the variable remaining_population_percent.
We substituted the outliers by mean value of the respective variables. The codes used for the task are urban_population$’popultaion percentage’[ which( abs(z.scores) >3 )] <-mean(urban_population$’population percentage’, na.rm = TRUE) and urban_population$remaining_population_percent[ which( abs(z.scores) >3 )] <-mean(urban_population$remaining_population_percent, na.rm = TRUE) for the respective variables.
Transforming variables can also eliminate outliers. Natural logarithm of a value reduces the variation caused by outliers. Binning is also a form of variable transformation. Hence, no method for removing the outliers were applied as there were just 4 outliers and are removed and normalized in next step.
We are now required to apply an appropriate transformation for at least one of the variables.
hist(urban_population$remaining_population_percent)
boxcox_transform<- BoxCox(urban_population$remaining_population_percent,lambda = "auto")
hist(boxcox_transform)
In this step, we plot histogram of the variable remaining_population_percent and observe a leftward skewness. Therefore, we apply the function boxcox with the argument lambda = “auto” to rectify the same.