MATH2349 Data Preprocessing - Assignment 3, Semester 1 2018

Required packages

library(readr)
library(tidyr)
library(dplyr)
library(Hmisc)
library(forecast)
rm(list=ls())

Executive Summary

This assignment investigated and pre-processed the two data sets of the WA (Western Australia) public sector workforce profile June in 2014 and 2016 respectively. First of all, baseR functions were used to understand the structure of the data. Since the original datasets have more than 100 variables each, only a few related variables were selected to conduct further pre-processing. In the first part of “Tidy & Manipulate Data” task, variable types were transformed into proper class. Also, as there are column names presented as values (female & male) instead of variables in the original data, tidyr gather function were used to transform data from wide to long format, then the new column “sex” was factorized. Next, the two datasets were merged into one dataframe, using “entity”, “sex” as keys.
For the second part of “Tidy & Manipulate Data” task, the goal is to compare the changes of employee numbers from 2014 to 2016 per entity. Therefore, mutate were performed to create a new column with the calculated result. In the first “Scan” task, missing values were detected, and impute technique was used based on investigation of missing values. Further scan the data to detect numeric outliers using boxplot; the cap technique was used to deal with the outliers. Finally, the data transformation technique was used to normalise the data, making ready for any future analysis that might require normality.

Data

This assignment adopt the data sets of the WA (Western Australia) public sector workforce profile June 2014 & 2016 from data.gov.au.
Link of Data Source: https://data.gov.au/dataset/wa-public-sector-workforce-profiles

Variable descriptions (selected columns):

Entity: the names of the public sector entity.
Female % : percentage of female employees.
Male % : percentage of male employees.
Total Permanent employees throughout the financial year : total number of permanent employees throughout the financial year.
Number of People with disability : number of people with disability in the entity.
People with disability Equity Index : index calculated based on the number of people with disability in the entity.

# Load WA public sector workforce profile for 2014 & 2016
wf2014 <- read_csv("WA public sector workforce 2014.csv")
wf2016 <- read_csv("WA public sector workforce 2016.csv")

Understand

Inspect the data structure of the WA public sector workforce profile for both 2014 & 2016
- check the dimensions of the data frame.
  - Result: there are 86 observations of 101 variables for year 2014, while
  - ```
    there are 88 observations of 103 variables for year 2016
```
- check the data types (i.e., character, numeric, integer, factor, and logical) of the variables in the data set.
  - Result: Data types for the variables in year 2014 & 2016 datasets are integer, character and numeric.
- check the column names in the data frame
  - Result: Column names for 2014 dataset ‘Entity’,‘Female %’, ‘Male %’ and ‘Total Permanent employees throughout the financial year’ will be renamed to “entity”, “female”, “male”, “# of employee” for simplicity, and, the same will be done to dataset 2016 plus 2 additional columns ‘Number of People with disability’ and ‘People with disability Equity Index’ to be changed to “disabiity(# of people)” and “disability equity index” respectively.
    - Rename will be done at step Tidy & Manipulate Data I.
- check incomplete rows of the 2014 & 2016 datasets
  - Result: The following issues have been identified.
    - Although it shows there is no incomplete values, further investigation shows N/A values exists in column ‘People with disability Equity Index’.
    - The above rows will be dealt with properly at step Scan I.

head(wf2014,3)   # display first 3 rows

# Any incompleted rows?
wf2014[!complete.cases(wf2014),]

head(wf2016,3)   # display first 3 rows

# Any incompleted rows?
wf2016[!complete.cases(wf2016),]

Tidy & Manipulate Data I

change Column names for 2014 dataset ‘Entity’,‘Female %’, ‘Male %’ and ‘Total Permanent employees throughout the financial year’ will be renamed to “entity”, “female”, “male”, “# of employee” for clarity, and, the same will be done to dataset 2016 plus 2 additional columns ‘Number of People with disability’ and ‘People with disability Equity Index’ to be changed to “disabiity(# of people)” and “disability equity index” respectively.
Manipulate female and male columns
- 1. Remove % character from columns ‘female’ and ‘male’ so that they can be converted from character to numeric.
- 1. Gather female and male columns into key-value pairs - sex and employee %
- 1. Recalculate # of employee for each row according to employee % as original # of employee including both female and male.
Reorder and rename columns for both datasets, wf2014 & wf2016
Combine datasets by entity and sex with merge() to create wf1416 for further processing.

# extracting required data columns
wf2014 <- wf2014[c('Entity','Female %', 'Male %', 'Total Permanent employees throughout the financial  year')]

# rename columns
colnames(wf2014) <- c("entity", "female", "male", "# of employee")

# remove % character for conversion to numeric data type
for (i in 1:nrow(wf2014)) {
    wf2014[i,"female"] <- substring(wf2014[i,"female"],1,nchar(wf2014[i,"female"])-1)
}
for (i in 1:nrow(wf2014)) {
    wf2014[i,"male"] <- as.numeric(substring(wf2014[i,"male"],1,nchar(wf2014[i,"male"])-1))
}

# convert columns to numeric data type
wf2014[,c(2,3)] <- sapply(wf2014[,c(2,3)], as.numeric)
str(wf2014)

## Classes 'tbl_df', 'tbl' and 'data.frame':    86 obs. of  4 variables:
##  $ entity       : chr  "WA public sector" "Department of Agriculture and Food" "Art Gallery of Western Australia" "Animal Resources Authority" ...
##  $ female       : num  71.9 46 73 71.8 56.8 50.6 46.7 63.4 55.6 50 ...
##  $ male         : num  28.1 54 27 28.2 43.2 49.4 53.3 36.6 44.4 50 ...
##  $ # of employee: int  109048 1204 58 59 137 111 0 802 491 102 ...

wf2016 <- wf2016[c('Entity','Female %', 'Male %', 'Total Permanent employees throughout the financial  year', 'Number of People with disability' ,'People with disability Equity Index')]

# rename columns
colnames(wf2016) <- c("entity", "female", "male", "# of employee","disabiity(# of people)" ,"disability equity index")

# remove % character for conversion to numeric data type
for (i in 1:nrow(wf2016)) {
    wf2016[i,"female"] <- as.numeric(substring(wf2016[i,"female"],1,nchar(wf2016[i,"female"])-1))
}
for (i in 1:nrow(wf2016)) {
    wf2016[i,"male"] <- as.numeric(substring(wf2016[i,"male"],1,nchar(wf2016[i,"male"])-1))
}

# convert columns to numeric data type
wf2016[,c(2,3,6)] <- sapply(wf2016[,c(2,3,6)], as.numeric)

## Warning in lapply(X = X, FUN = FUN, ...): NAs introduced by coercion

str(wf2016)

## Classes 'tbl_df', 'tbl' and 'data.frame':    88 obs. of  6 variables:
##  $ entity                 : chr  "WA public sector" "Department of Agriculture and Food" "Art Gallery of Western Australia" "Animal Resources Authority" ...
##  $ female                 : num  72.3 45.5 71.2 73.2 52.9 54.4 63.4 56.7 50.4 66.7 ...
##  $ male                   : num  27.7 54.5 28.8 26.8 47.1 45.6 36.6 43.3 49.6 33.3 ...
##  $ # of employee          : int  110916 1105 57 57 108 10 730 440 96 147 ...
##  $ disabiity(# of people) : int  1873 18 2 2 1 0 29 26 4 0 ...
##  $ disability equity index: num  101.7 103.5 19.4 12 2.7 ...

# Group column headings, female & male, that are values under a variable column 'sex' and
#   their values in another variable column 'employee %'.
wf2014 <- wf2014 %>% gather("sex", "employee %", 2:3)

# recalculate # of employee according to employee %
for (i in 1:nrow(wf2014)) {
    wf2014[i,"# of employee"] <- as.integer(wf2014[i, "# of employee"] * wf2014[i,"employee %"] / 100)
}
str(wf2014)

## Classes 'tbl_df', 'tbl' and 'data.frame':    172 obs. of  4 variables:
##  $ entity       : chr  "WA public sector" "Department of Agriculture and Food" "Art Gallery of Western Australia" "Animal Resources Authority" ...
##  $ # of employee: int  78405 553 42 42 77 56 0 508 272 51 ...
##  $ sex          : chr  "female" "female" "female" "female" ...
##  $ employee %   : num  71.9 46 73 71.8 56.8 50.6 46.7 63.4 55.6 50 ...

# recalculate # of employee according to employee %
wf2016 <- wf2016 %>% gather("sex", "employee %", 2:3)
for (i in 1:nrow(wf2016)) {
    wf2016[i,"# of employee"] <- as.integer(wf2016[i, "# of employee"] * wf2016[i,"employee %"] / 100)
}
str(wf2016)

## Classes 'tbl_df', 'tbl' and 'data.frame':    176 obs. of  6 variables:
##  $ entity                 : chr  "WA public sector" "Department of Agriculture and Food" "Art Gallery of Western Australia" "Animal Resources Authority" ...
##  $ # of employee          : int  80192 502 40 41 57 5 462 249 48 98 ...
##  $ disabiity(# of people) : int  1873 18 2 2 1 0 29 26 4 0 ...
##  $ disability equity index: num  101.7 103.5 19.4 12 2.7 ...
##  $ sex                    : chr  "female" "female" "female" "female" ...
##  $ employee %             : num  72.3 45.5 71.2 73.2 52.9 54.4 63.4 56.7 50.4 66.7 ...

# reorder columns
wf2014 <- wf2014[,c(1,3,4,2)]
wf2016 <- wf2016[,c(1,5,6,2,3,4)]

# rename columns to include year
colnames(wf2014) <- c("entity", "sex", "employee % 14", "# of employee 14")
colnames(wf2016) <- c("entity", "sex", "employee % 16", "# of employee 16","disability(# of people)_16", "disability equity index_16")

#merge the 2014 and 2016 data into 1 dataframe "wf1416"
wf1416 <- merge(wf2014,wf2016, by = c("entity", "sex"))

head(wf1416, 3)

To transform the “sex” column from character to “factor”, we factorize “sex” column in “wf1416” data set.

#factorize "sex"
wf1416$sex <-  factor(wf1416$sex)

str(wf1416)

## 'data.frame':    126 obs. of  8 variables:
##  $ entity                    : chr  "Animal Resources Authority" "Animal Resources Authority" "Art Gallery of Western Australia" "Art Gallery of Western Australia" ...
##  $ sex                       : Factor w/ 2 levels "female","male": 1 2 1 2 1 2 1 2 1 2 ...
##  $ employee % 14             : num  71.8 28.2 73 27 50.6 49.4 50 50 75.9 24.1 ...
##  $ # of employee 14          : int  42 16 42 15 56 54 51 51 25 7 ...
##  $ employee % 16             : num  73.2 26.8 71.2 28.8 52.9 47.1 50.4 49.6 77.8 22.2 ...
##  $ # of employee 16          : int  41 15 40 16 57 50 48 47 21 5 ...
##  $ disability(# of people)_16: int  2 2 2 2 1 1 4 4 3 3 ...
##  $ disability equity index_16: num  12 12 19.4 19.4 2.7 ...

Tidy & Manipulate Data II

Mutate a new column to calculate the change regarding to the number of employees from 2014 to 2016.

# mutate a new column "change_of_employees(%)", and rename the df as "wf1416_m"
#wf1416_m <-  wf1416 %>% mutate("change_of_employees(%)" = ((wf1416$`employee % 16` - wf1416$`employee % 14`)/ wf1416$`employee % 14`)*100 )

wf1416_m <-  wf1416 %>% mutate("change_of_#_of_employees" = wf1416$`# of employee 16` - wf1416$`# of employee 14`)

#reorder the colmumns 
 
wf1416_m<- wf1416_m[,c(1,2,4,6,3,5,9,7,8)]

head(wf1416_m)

Scan I

Scan the data set for missing values.

#find the number of missing values
sum(is.na(wf1416_m))

## [1] 28

Find the location of missing value, using which and is.na function.

which(is.na(wf1416_m$`disability equity index`))

##  [1]  11  12  13  14  77  78  85  86  89  90  91  92  93  94 101 102 105
## [18] 106 111 112 115 116 117 118 121 122 123 124

Calculate of percentage of missing values of total obervations

sum(is.na(wf1416_m)) / nrow(wf1416_m) # number of missing values / number of observations

## [1] 0.2222222

Since the missing value takes up almost 20% of total observation. It’s inappropriate to simply delete the observation with missing values. Further inspect to the data, we find that the NA values in “disability equity index” was associated with the zero number of people with disability in the entity–column name: “disability(# of people)_16“, as illustrated below. Therefore, it’s more suitable to give zero values to the missing values in”disability equity index." In another words, if there are no people with disability in the entity,the corresponding index should also be zero.

# subset the NA observations, and name the illustrated df as "wf1416_m_NA"
wf1416_m_NA <- (wf1416_m[,c(1,8,9)])[!complete.cases(wf1416_m[,c(1,8,9)]),]

wf1416_m_NA

wf1416_m$`disability equity index_16` <- impute(wf1416_m$`disability equity index_16`, 0) 

## check which values are imputed
is.imputed(wf1416_m$`disability equity index_16`)

##   [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
##  [12]  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [34] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [45] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [56] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [67] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
##  [78]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE
##  [89]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE
## [100] FALSE  TRUE  TRUE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE
## [111]  TRUE  TRUE FALSE FALSE  TRUE  TRUE  TRUE  TRUE FALSE FALSE  TRUE
## [122]  TRUE  TRUE  TRUE FALSE FALSE

Scan II

There’s total 7 numeric variables in the wf1416_m dataframe. We use boxplot to scan the variables for “outliers.” Based on the results of boxplots, there are 5 varibles with outliers : “# of employee 14”,“# of employee 16” ,“change_of_#_of_employees“,”disability(# of people)_16“,”disability equity index_16"

In “# of employee 14” boxplot, there’re obvios outliers.

boxplot(wf1416_m$`# of employee 14`)

In “# of employee 16” boxplot, there’re some outliers.

boxplot(wf1416_m$`# of employee 16`)

In “employee % 14” boxplot, there’s no outlier indicated.

boxplot(wf1416_m$`employee % 14`)

In “employee % 16” boxplot, there’s no outlier indicated.

boxplot(wf1416_m$`employee % 16`)

In “change_of_#_of_employees" boxplot, there’re some outliers .

boxplot(wf1416_m$`change_of_#_of_employees`)

In “disability(# of people)_16" boxplot, there’re some outliers.

boxplot(wf1416_m$`disability(# of people)_16`)

In “disability equity index_16” boxplot, there’s one outliers.

#transform the class of  `disability equity index_16`from "impute" to "numeric"
wf1416_m$`disability equity index_16` <- as.numeric(wf1416_m$`disability equity index_16`)
#boxplot
boxplot(wf1416_m$`disability equity index_16`)

Deal with outliers Define a function name “cap” to cap the values outside the limits

# Define a function to cap the values outside the limits

cap <- function(x){
    quantiles <- quantile( x, c(.05, 0.25, 0.75, .95 ) )
    x[ x < quantiles[2] - 1.5*IQR(x) ] <- quantiles[1]
    x[ x > quantiles[3] + 1.5*IQR(x) ] <- quantiles[4]
    x
}


NA_removed_wf1416 <- subset(wf1416_m, wf1416_m$`disability(# of people)_16` != 0)

# Take a subset of NA_removed_wf1416 data using quantitative variables

NA_removed_wf1416_sub <- NA_removed_wf1416 %>%  dplyr::select('# of employee 14',
                                                              '# of employee 16',
                                                              'employee % 14',
                                                              'employee % 16',
                                                              'change_of_#_of_employees',
                                                              'disability(# of people)_16',
                                                              'disability equity index_16')

# See descriptive statistics
summary(NA_removed_wf1416_sub)

##  # of employee 14  # of employee 16   employee % 14   employee % 16  
##  Min.   :    7.0   Min.   :    5.00   Min.   :16.90   Min.   :17.50  
##  1st Qu.:   54.0   1st Qu.:   52.25   1st Qu.:38.75   1st Qu.:38.95  
##  Median :   99.5   Median :   96.00   Median :50.00   Median :50.00  
##  Mean   : 1395.9   Mean   : 1411.60   Mean   :50.00   Mean   :50.00  
##  3rd Qu.:  394.8   3rd Qu.:  367.25   3rd Qu.:61.25   3rd Qu.:61.05  
##  Max.   :78405.0   Max.   :80192.00   Max.   :83.10   Max.   :82.50  
##  change_of_#_of_employees disability(# of people)_16
##  Min.   : -89.00          Min.   :   1.00           
##  1st Qu.:  -7.00          1st Qu.:   3.00           
##  Median :  -2.00          Median :   5.00           
##  Mean   :  15.74          Mean   :  53.45           
##  3rd Qu.:   2.00          3rd Qu.:  15.00           
##  Max.   :1787.00          Max.   :1873.00           
##  disability equity index_16
##  Min.   :  2.7             
##  1st Qu.: 60.1             
##  Median : 77.9             
##  Mean   :105.7             
##  3rd Qu.:128.3             
##  Max.   :451.3

# Apply a user defined function "cap" to a data frame

NA_removed_wf1416_capped <- cbind(NA_removed_wf1416[,1:2] ,sapply(NA_removed_wf1416_sub, FUN = cap))

# Check summary statistics again

summary(NA_removed_wf1416_capped)

##     entity              sex     # of employee 14 # of employee 16 
##  Length:98          female:49   Min.   :   7.0   Min.   :   5.00  
##  Class :character   male  :49   1st Qu.:  54.0   1st Qu.:  52.25  
##  Mode  :character               Median :  99.5   Median :  96.00  
##                                 Mean   : 298.1   Mean   : 295.16  
##                                 3rd Qu.: 394.8   3rd Qu.: 367.25  
##                                 Max.   :1379.5   Max.   :1396.80  
##  employee % 14   employee % 16   change_of_#_of_employees
##  Min.   :16.90   Min.   :17.50   Min.   :-43.750         
##  1st Qu.:38.75   1st Qu.:38.95   1st Qu.: -7.000         
##  Median :50.00   Median :50.00   Median : -2.000         
##  Mean   :50.00   Mean   :50.00   Mean   : -2.735         
##  3rd Qu.:61.25   3rd Qu.:61.05   3rd Qu.:  2.000         
##  Max.   :83.10   Max.   :82.50   Max.   : 36.950         
##  disability(# of people)_16 disability equity index_16
##  Min.   : 1.00              Min.   :  2.7             
##  1st Qu.: 3.00              1st Qu.: 60.1             
##  Median : 5.00              Median : 77.9             
##  Mean   :11.45              Mean   :101.3             
##  3rd Qu.:15.00              3rd Qu.:128.3             
##  Max.   :41.00              Max.   :242.8

Transform

In data preprocessing, normalisation is conducted for any further statistical analysis / hypothesis. The histogram of “disability equity index_16” column appears serious right-skewed. To decrease the skewness and convert the distribution into a normal distribution,Log transformation were used in this section to normalize the data.

# plot histogram
hist(NA_removed_wf1416_capped$`# of employee 14`, breaks = 10)

Log transformation has normalised “# of employee 14” as shown in the histogram. Any statistical inferences basing on normalisation of data is ready to be carried out.

#Log transformation
ln_DEI16 <- log(NA_removed_wf1416_capped$`# of employee 14`)
# histogram after data transformation
hist(ln_DEI16)