library(readr)
library(tidyr)
library(dplyr)
library(outliers)
library(forecast)
Labor force data which is available in Wrold bank open data are used for this study which attempted to merge totla workforce data of the world with female workforce data as a percentage of total workforce. After merging the data sets unsing left_join function. Male labor force were mutated using the total labor force and female labor force and data were more processed for tidying using gather function which created the ordered fator Gender variable. Country name and Gender variables were converted to character and factor variable respectively. Gender variable was leballed and ordered. To understand the new data frame “Labor”, structure and attribute fucntions were used.
Data set were scanned for missing values and since missing values percentage, 0.08% is less than 5% missing values were ommited. Data set was then scanned for any numerical outliers and outliers are dealed with capping. Labor data set was tranformed using BoxCox since data set is largley left skewed and transformation is required. readr, tidyr,dplyr,outliers and forecast are tehe functions greatly used for this assignment.
Labor force related data in the world are considered for the assingnment and data are extracted from World bank open data which is available in the website of https://data.worldbank.org/indicator. Two data sets including world labor force data and female labor force as a percentage of total labor force are used intially. readr.cvs fucntion is used for importing and listing data sets.
Labor data set contains the information about list o f countries in the world and the workforce from year 1990-2017 and Female data set is the percentage of women working in each country from 1990-2017. Both th datas sets are containing factor varibale and numeric variables. After reading the data, the columns which are required for the analysis filttered leaving country code, Indicator name, Indicator code.
gather() fucntion was used to tidy both data sets before merging and merged using left_outer join.
Labor <- read.csv("C:/Users/User/Downloads/Labor force.csv",skip = 4)
Labor <- Labor%>% select(c(1,35:62))
Labor <- Labor %>% gather(c(`X1990`:`X2017`),key = "Years", value = "Labor force")
Female <- read.csv("C:/Users/User/Downloads/Female.csv", skip = 4)
Female <- Female %>% select(c(1,35:62))
Female <- Female %>% gather(c(`X1990`:`X2017`),key="Years",value="Female Labor force")
Labor <- Labor %>% left_join(Female, by=c("Country.Name","Years"))
Labor
first using mutate funcion ratio of female workforce was converted to number of workfoce and then by deducting totoal number of workforce by the number of female workforce, number of males in the workforce identified. Country name, Years, Female and male work force which required only were selected from the Laor data set. Finally using the gather() function Female and Male labor varibles grouped in to Gender variable.
Country name were converted to a character and Gender is saved as a factor and then ordered.
Labor <- Labor %>% mutate(`Female labor`= round(`Female Labor force`/100*`Labor force`,0))
Labor <- Labor %>% mutate(`Male labor`=`Labor force`-`Female labor`)
Labor <- Labor %>% select(c(`Country.Name`,`Years`,`Female labor`,`Male labor`))
Labor <- Labor %>% gather(c(`Female labor`,`Male labor`),key="Gender",value="Labor")
Labor$Country.Name <- as.character(Labor$Country.Name)
Labor$Gender <- as.factor(Labor$Gender)
Labor$Gender <- factor(Labor$Gender,levels = c("Female labor","Male labor"),labels = c("Female","Male"), ordered = TRUE)
Labor
NEw data set created usign merging the intial data sets are considered here, Data set is consisted of following variables, Country name - character variable Year - Character varable Geneder - factor variable Labor - Number
Labor is a ordered factor and lablled and levels are as female and male
str(Labor)
## 'data.frame': 14784 obs. of 4 variables:
## $ Country.Name: chr "Aruba" "Afghanistan" "Angola" "Albania" ...
## $ Years : chr "X1990" "X1990" "X1990" "X1990" ...
## $ Gender : Ord.factor w/ 2 levels "Female"<"Male": 1 1 1 1 1 1 1 1 1 1 ...
## $ Labor : num NA 461826 2442652 580237 NA ...
attributes(Labor$Gender)
## $levels
## [1] "Female" "Male"
##
## $class
## [1] "ordered" "factor"
Total of 1754 missing values were found from the Labor data set which is 0.08% from the total values. Since the missing values are less than 5% decided to exclude the missing values. incomplete cases were first subsetted and then omitted using na.omit fucntion to get the Labor2 dataframe which is consisted of completed cases.
sum(is.na(Labor))
## [1] 1754
Labor2 <- Labor[!complete.cases(Labor),]
Labor2 <- na.omit(Labor)
Labor2
To scan the data for numeric outliers, z score method is used. Z score is a distance based method which detect univaraite outliers and standardised z score of 3 is considered here. scores function in outlier package is used to determine any outliers. To deal with the outliers Capping method is used, Capping allows the data set to repalce its values wtih outilers to the nearest not outlier.
Labor_outliers <- Labor2$Labor %>% scores(type = "z")
Labor_outliers %>% summary()
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.3193 -0.3163 -0.3083 0.0000 -0.2469 10.0078
cap <- function(x){
quantiles <- quantile( x, c(0.05, 0.25, 0.75, 0.95 ) )
x[ x < quantiles[2] - 1.5*IQR(x) ] <- quantiles[1]
x[ x > quantiles[3] + 1.5*IQR(x) ] <- quantiles[4]
x
}
Labor_capped <- Labor2$Labor %>% cap()
Labor2
After observing the histogram of data set it is observed that data set is not symmetric. Noraml distirbution is essential for data analysing in the data. Transformation helps to convert non-linear relationships in to liner relationship hence to convert left-skewed distibution in to a normal distribution. Box-Cox transformation is used here to convert non-normla distirution to normal distibution using forecast package.
hist(Labor2$Labor)
Boxcox_Labor <- BoxCox(Labor2$Labor, lambda = "auto")
hist(Boxcox_Labor)