Packages Used: 1. readr :- This package is used to read the data from csv file to RStudio. 2. dplyr :- This package is used for fast data manipulation. It provides functions to summarise, rearrange data, etc. 3. tidyr :- This package is used to tidy the dataset. It can be used to change the layout of the dataset. 4. hmisc :- This is used to handle the na and missing values. 5. forecast :- This package is for data transformation. 6. outliners :- This package is used to detect and handle outliers.

library(readr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr)
#install.packages('forecast')
library(forecast)
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
#install.packages('Hmisc')
library(Hmisc)
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## Loading required package: ggplot2
## 
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
## 
##     src, summarize
## The following objects are masked from 'package:base':
## 
##     format.pval, units
#install.packages('outliers')
library(outliers)

Executive Summary

We have selected two datasets. One of them informs us about the life expectancy. The life expectancy dataset is a tidy dataset. The dataset follows the principles of tidy data. The second dataset is about the percentage of employers in the job markets of various countries. The second dataset is untidy. The data is untidy as the data over the years is populated in different columns. We will merge these two datasets using the join function. We have to perform some steps to tidy the data. The two datasets have a column specifying the country name. We can join the two tables by using “country” column. We keep the name of the country column same. The untidy data was in the wide format. The gather function was used to convert the untidy data to tidy format. Then the datasets were merged. We have selected some columns using the filter function. The NA values were removed. The columns with incorrect datatypes were corrected. The pre-processing tasks were performed and now we can move ahead with the assignment.

Data

We have selected two datasets for this project. One is tidy and the other is untidy. The tidy data set is about life expectancy. The attributes are as follows: - 1. Country - Name of the country 2. Year - Year of occurance. 3. Status - Developed or Developing country. 4. Life expectancy - Life expectancy average age. 6. Adult Mortality - Adult Mortality Rate / 1000 people 7. infant deaths - Number of infant deaths / 1000 people 8. Alcohol - Alcohol consumption (liters) 9. percentage expenditure - Spending on the health sector 10. Hepatitis B - Immunity against Hepatitis B in children under 1 year of age 11. Measles - Measels cases / 1000 people 12. BMI = Average Body Mass Index 13. Polio = Number of polio cases / 1000 people 14. Total expenditure = Total yearly expenditure 15. Diphtheria = Number of diptheria cases / 1000 people 16. HIV/AIDS = Number of HIV/AIDS cases / 1000 people 17. GDP = GDP of the country 18. Population = Population of the country

Only Country and Status had character as their datatype. The rest columns are of datatype double.

The second dataset indicates the percent employers in the global job market. It is an untidy dataset. This dataset was taken from the www.worlddatabank.org The attributes are as follows: - 1. Series Name - Factor (Employers, total (% of total employment) (modeled ILO estimate)) 2. Series Code - Code for that particular series. 3. Country Name - Name of the country. 4. Country Code - Country Code 5. 2007 [YR2007] - 2016 [YR2016] = Year of occurance

library(readxl)
jobs <- read_excel("~/Master of Data Science (RMIT)/Semester 2/Data Wrangling/Assignments/Assignment 2/Assignment 2_2/Data_Extract_From_Jobs.xlsx")
#View(jobs)
head(jobs)
life_expectancy <- read_csv("~/Master of Data Science (RMIT)/Semester 2/Data Wrangling/Assignments/Assignment 2/Assignment 2_2/Life Expectancy Data.csv")
## Parsed with column specification:
## cols(
##   .default = col_double(),
##   Country = col_character(),
##   Status = col_character()
## )
## See spec(...) for full column specifications.
#View(life_expectancy)
head(life_expectancy)

Filtering the data sets for selecting the required columns

jobs_filtered <- jobs[,c(3,5:14)]
life_expectancy_filtered <- life_expectancy[,c(1:4,6,18)]

Renaming the columns

jobs_filtered <- rename(jobs_filtered,"country"="Country Name")
jobs_filtered <- rename(jobs_filtered,"2007"="2007 [YR2007]")
jobs_filtered <- rename(jobs_filtered,"2008"="2008 [YR2008]")
jobs_filtered <- rename(jobs_filtered,"2009"="2009 [YR2009]")
jobs_filtered <- rename(jobs_filtered,"2010"="2010 [YR2010]")
jobs_filtered <- rename(jobs_filtered,"2011"="2011 [YR2011]")
jobs_filtered <- rename(jobs_filtered,"2012"="2012 [YR2012]")
jobs_filtered <- rename(jobs_filtered,"2013"="2013 [YR2013]")
jobs_filtered <- rename(jobs_filtered,"2014"="2014 [YR2014]")
jobs_filtered <- rename(jobs_filtered,"2015"="2015 [YR2015]")
jobs_filtered <- rename(jobs_filtered,"2016"="2016 [YR2016]")
head(jobs_filtered)
life_expectancy_filtered <- rename(life_expectancy_filtered,"country"="Country")
life_expectancy_filtered <- rename(life_expectancy_filtered,"year"="Year")
life_expectancy_filtered <- rename(life_expectancy_filtered,"status"="Status")
life_expectancy_filtered <- rename(life_expectancy_filtered,"life_expectancy"="Life expectancy")
life_expectancy_filtered <- rename(life_expectancy_filtered,"infant_deaths"="infant deaths")
life_expectancy_filtered <- rename(life_expectancy_filtered,"population"="Population")
head(life_expectancy_filtered)

Replacing the “..” values with NA

jobs_filtered$`2007`[jobs_filtered$`2007` == ".." ] <- NA
jobs_filtered$`2008`[jobs_filtered$`2008` == ".." ] <- NA
jobs_filtered$`2009`[jobs_filtered$`2009` == ".." ] <- NA
jobs_filtered$`2010`[jobs_filtered$`2010` == ".." ] <- NA
jobs_filtered$`2011`[jobs_filtered$`2011` == ".." ] <- NA
jobs_filtered$`2012`[jobs_filtered$`2012` == ".." ] <- NA
jobs_filtered$`2013`[jobs_filtered$`2013` == ".." ] <- NA
jobs_filtered$`2014`[jobs_filtered$`2014` == ".." ] <- NA
jobs_filtered$`2015`[jobs_filtered$`2015` == ".." ] <- NA  
jobs_filtered$`2016`[jobs_filtered$`2016` == ".." ] <- NA

Understand

We have used the summary function to understand the central tendencies of all the columns in the data which have numeric datatypes. We check the datatypes of all the columns and convert some datatypes from the jobs dataset that need converting. The datatypes of the years are converted from character to numeric in the jobs dataset.

summary(jobs_filtered)
##    country              2007               2008               2009          
##  Length:242         Length:242         Length:242         Length:242        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##      2010               2011               2012               2013          
##  Length:242         Length:242         Length:242         Length:242        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##      2014               2015               2016          
##  Length:242         Length:242         Length:242        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character
summary(life_expectancy_filtered)
##    country               year         status          life_expectancy
##  Length:2938        Min.   :2000   Length:2938        Min.   :36.30  
##  Class :character   1st Qu.:2004   Class :character   1st Qu.:63.10  
##  Mode  :character   Median :2008   Mode  :character   Median :72.10  
##                     Mean   :2008                      Mean   :69.22  
##                     3rd Qu.:2012                      3rd Qu.:75.70  
##                     Max.   :2015                      Max.   :89.00  
##                                                       NA's   :10     
##  infant_deaths      population       
##  Min.   :   0.0   Min.   :3.400e+01  
##  1st Qu.:   0.0   1st Qu.:1.958e+05  
##  Median :   3.0   Median :1.387e+06  
##  Mean   :  30.3   Mean   :1.275e+07  
##  3rd Qu.:  22.0   3rd Qu.:7.420e+06  
##  Max.   :1800.0   Max.   :1.294e+09  
##                   NA's   :652
typeof(life_expectancy_filtered$country)
## [1] "character"
typeof(life_expectancy_filtered$year)
## [1] "double"
typeof(life_expectancy_filtered$status)
## [1] "character"
typeof(life_expectancy_filtered$life_expectancy)
## [1] "double"
typeof(life_expectancy_filtered$infant_deaths)
## [1] "double"
typeof(life_expectancy_filtered$population)
## [1] "double"
typeof(jobs_filtered$country)
## [1] "character"
typeof(jobs_filtered$`2007`)
## [1] "character"
typeof(jobs_filtered$`2008`)
## [1] "character"
typeof(jobs_filtered$`2009`)
## [1] "character"
typeof(jobs_filtered$`2010`)
## [1] "character"
typeof(jobs_filtered$`2011`)
## [1] "character"
typeof(jobs_filtered$`2012`)
## [1] "character"
typeof(jobs_filtered$`2014`)
## [1] "character"
typeof(jobs_filtered$`2015`)
## [1] "character"
typeof(jobs_filtered$`2016`)
## [1] "character"

Converting the datatype from character to double

jobs_filtered$`2007`<-as.double(jobs_filtered$`2007`)
jobs_filtered$`2008`<-as.double(jobs_filtered$`2008`)
jobs_filtered$`2009`<-as.double(jobs_filtered$`2009`)
jobs_filtered$`2010`<-as.double(jobs_filtered$`2010`)
jobs_filtered$`2011`<-as.double(jobs_filtered$`2011`)
jobs_filtered$`2012`<-as.double(jobs_filtered$`2012`)
jobs_filtered$`2013`<-as.double(jobs_filtered$`2013`)
jobs_filtered$`2014`<-as.double(jobs_filtered$`2014`)
jobs_filtered$`2015`<-as.double(jobs_filtered$`2015`)
jobs_filtered$`2016`<-as.double(jobs_filtered$`2016`)
life_expectancy_filtered$year <- as.factor(life_expectancy_filtered$year)
typeof(jobs_filtered$`2007`)
## [1] "double"
typeof(jobs_filtered$`2008`)
## [1] "double"
typeof(jobs_filtered$`2009`)
## [1] "double"
typeof(jobs_filtered$`2010`)
## [1] "double"
typeof(jobs_filtered$`2011`)
## [1] "double"
typeof(jobs_filtered$`2012`)
## [1] "double"
typeof(jobs_filtered$`2014`)
## [1] "double"
typeof(jobs_filtered$`2015`)
## [1] "double"
typeof(jobs_filtered$`2016`)
## [1] "double"
class(life_expectancy_filtered$year)
## [1] "factor"

Tidy and Manipulate 1

The jobs data is in untidy format. We need to tidy it before performing any operations. The jobs data has distinct columns for each year. Hence it is in the wide format. It needs to be converted to a tidy format by using the gather format. We have set the key as “year” and value as “percent_employers”. Our data will be tidy after this operation. We have used left join on both the datasets to merge them with the help of “country” and “year” as both of these columns are common for all the datasets.

jobs_filtered2 <- jobs_filtered  %>% gather(key="year", value="percent_employers", c(`2007`:`2016`))
head(jobs_filtered2)
typeof(life_expectancy_filtered)
## [1] "list"
integrated_data <- left_join(jobs_filtered2,life_expectancy_filtered, by = c("country"="country","year"=
"year"))
head(integrated_data)
integrated_data <- integrated_data %>% drop_na(`country`)

Tidy and Manipulate 2

We found the rate of employers in the data by using the mutate function. We used the mutate function to calculate the rate by dividing the percent employers in the country by the population of the country. The data is saved in new column “rate_percent_employers”. The summary statistics for the “rate_percent_employers” was also found out.

integrated_data <- mutate(integrated_data,rate_employer_percentage = (integrated_data$percent_employers/integrated_data$population)*100)
summary(integrated_data$rate_employer_percentage)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0000  0.0000  0.0002  0.0297  0.0011 10.0415    1164

Scan 1

We checked our data for missing values by using the is.na() function. colSums function was used to find the missing values in all the columns of the dataset in one line of code. The impute() function was used to replace the missing values. The missing values were replaced with the mean for “percent_employers”, “life_expectancy”, “infant_deaths”, “population” and “rate_employer_percentage” with missing values. This was done because the values of these columns are numeric. The missing values in the “status” column were replaced with mode as value is categorical. The class of the imputed column changes to impute. We changed the classes back to their original types to plot graphs in the further sections.

colSums(is.na(integrated_data))
##                  country                     year        percent_employers 
##                        0                        0                      306 
##                   status          life_expectancy            infant_deaths 
##                     1000                     1007                     1000 
##               population rate_employer_percentage 
##                     1144                     1164
integrated_data$percent_employers <- impute(integrated_data$percent_employers,fun = mean)
class(integrated_data$percent_employers)
## [1] "impute"
integrated_data$percent_employers <- impute(integrated_data$percent_employers,fun = mean)
integrated_data$life_expectancy <- impute(integrated_data$life_expectancy,fun = mean)
integrated_data$infant_deaths <- impute(integrated_data$infant_deaths,fun = mean)
integrated_data$population <- impute(integrated_data$population,fun = mean)
integrated_data$rate_employer_percentage <- impute(integrated_data$rate_employer_percentage,fun = mean)
integrated_data$status <- impute(integrated_data$status,fun = mode)
integrated_data$status <- as.character(integrated_data$status)
integrated_data$percent_employers <- as.numeric(as.character(integrated_data$percent_employers))
integrated_data$life_expectancy <- as.numeric(as.character(integrated_data$life_expectancy))
integrated_data$infant_deaths <- as.numeric(as.character(integrated_data$infant_deaths))
integrated_data$population <- as.numeric(as.character(integrated_data$population))
integrated_data$rate_employer_percentage <- as.numeric(as.character(integrated_data$rate_employer_percentage))
class(integrated_data$status)
## [1] "character"
class(integrated_data$percent_employers)
## [1] "numeric"
class(integrated_data$life_expectancy)
## [1] "numeric"
class(integrated_data$infant_deaths)
## [1] "numeric"
class(integrated_data$population)
## [1] "numeric"
class(integrated_data$rate_employer_percentage)
## [1] "numeric"
sum(is.infinite(integrated_data$country))
## [1] 0
sum(is.infinite(integrated_data$year))
## [1] 0
sum(is.infinite(integrated_data$percent_employers))
## [1] 0
sum(is.infinite(integrated_data$status))
## [1] 0
sum(is.infinite(integrated_data$life_expectancy))
## [1] 0
sum(is.infinite(integrated_data$infant_deaths))
## [1] 0
sum(is.infinite(integrated_data$population))
## [1] 0
sum(is.infinite(integrated_data$rate_employer_percentage))
## [1] 0
head(integrated_data)

Scan 2

The point which lies beyond 1.5 times the interquartile range is called as outlier. A box plot is used to detect outliers. In a statistical research, one can choose to remove or keep outliers. It depends on the research being carried out. We have chosen to remove the outliers in this instance. This can be done using the capping method (also called as Winsorizing). In capping method we replace the outliers with the nearest neighbours that are not outliers. We can detect the difference between the box plots with and without the outliers. There is a stark difference in both the cases.

#Due to page constraints, only one plot with outliers is shown. Code for plots of other columns is present but commented
boxplot(integrated_data$percent_employers,main = "Percent Employers",col = "green",horizontal = TRUE)

#boxplot(integrated_data$life_expectancy,main = "Life Expectancy",col = "blue",horizontal = TRUE)
#boxplot(integrated_data$infant_deaths,main = "Infant Deaths",col = "red",horizontal = TRUE)
#boxplot(integrated_data$population,main = "Population",col = "pink",horizontal = TRUE)
#boxplot(integrated_data$rate_employer_percentage,main = "Rate Employers",col = "yellow",horizontal = TRUE)
cap <- function(x){
 quantiles <- quantile( x, c(.05, 0.25, 0.75, .95 ))
 x[ x < quantiles[2] - 1.5*IQR(x) ] <- quantiles[1]
 x[ x > quantiles[3] + 1.5*IQR(x) ] <- quantiles[4]
 x
}
integrated_data$percent_employers %>% cap() %>% boxplot(main="Percent Employers (cap)",col="green",horizontal = TRUE)

integrated_data$life_expectancy %>% cap() %>% boxplot(main="Life Expectancy  (cap)",col="blue",horizontal = TRUE)

integrated_data$infant_deaths %>% cap() %>% boxplot(main="Infant Deaths  (cap)",col="red",horizontal = TRUE)

integrated_data$population %>% cap() %>% boxplot(main="Population  (cap)",col="pink",horizontal = TRUE)

integrated_data$rate_employer_percentage %>% cap() %>% boxplot(main="Rate of Employers (cap)",col="yellow",horizontal = TRUE)

Transform

Data transformation is required to improve normality of variance i.e. skewness of variances. BoxCox transformation is used to transform non-normal data to a normalized form. The skewness can be reduced by using log, root, square, power transforms. Normalization is generally used for scaling of data before using machine learning algorithms. We have performed various transformations on the data.

class(integrated_data$percent_employers)
## [1] "numeric"
integrated_data$percent_employers %>% hist(main="Employers")

bc <- BoxCox(integrated_data$percent_employers,lambda = "auto")
bc %>% hist(main="BOX-COX")

#for reducing right skewness
log_tf<-log(integrated_data$percent_employers)
hist(log_tf)

#for left skewness
square<- function(x) {x*x}
sq_tf<-square(integrated_data$percent_employers)
hist(sq_tf)

#normalizing without centering
norm_tf<-scale(integrated_data$percent_employers,center = FALSE,scale = sd(integrated_data$percent_employers))
norm_centre_tf<-scale(integrated_data$percent_employers,center = TRUE,scale = sd(integrated_data$percent_employers))
hist(norm_tf)

hist(norm_centre_tf)

References