MATH2349 Semester 1, 2019

Required packages

The following packages were utilized in the the proceeding project.

library(readr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(lubridate)

## 
## Attaching package: 'lubridate'

## The following object is masked from 'package:base':
## 
##     date

Executive Summary

Data preprocessing is a vital step in any sort of data analysis, due to the quality of the analysis being highly dependent on the quality of the data it utilizes. For this purpose, the Human Resources Dataset will be preprocessed in the following project. In order to achieve a dataset that is suitable for analysis, a number of preprocessing tasks will be undertaken. First, the data will be inspected so that it’s dimensions, attributes and structure may be understood. Consequently, any necessary data type conversions will be performed. This step will include the conversion of character strings in to ordered factors where necessary, and the conversion of character strings that contain date related information into date type data. Following these preliminary steps, the data will be checked to determine whether or not it adhere’s to Hadley Whickam’s Tidy Data Principles, it will then be modified as necessary.

Further, a mutation will be performed on the Days Employed variable, to create a Months Employed variable. The dataset will then be further checked for missing values, inconsistencies and errors and if found, these issues will be dealt with accordingly. Following this, a similar course will be taken with regard to outliers that may be present in the numerical variables in the data frame. Finally, a transformation will be performed in order to coerce the Pay Rate variable into a greater degree of normality for statistical analysis. At the conclusion of the aforementioned processes, the data will be ready for analysis.

Data

The following report is based on a merging of two datasets within the Human Resources Dataset. The datasets were retrieved from Kaggle and can be found at https://www.kaggle.com/rhuebner/human-resources-data-set/downloads/human-resources-data-set.zip/4#HRDataset_v9.csv. The “HRDataset_v9” and “production_staff” tables were created through a semi join on the Employee Name variable.

Further to the merging of the two datasets, a subset has been created to exclude the follow variables: MarriedID, MaritalStatusID, GenderID, EmpStatus_ID, DeptID and Perf_ScoreID. This decision was made due to the fact that all of the variables in question contained ID numbers that did not contain information that could not be determined elsewhere in the joined dataset.

Therefore the original join contained 209 observations of 28 variables. The resulting subset then contained 209 observations of 22 variables detailing various details of the employees.

The subset of data used in this project includes both personal details of employees along with details regarding their employment. The full set of variables for the subset before pre-processing are listed as follows:

Employee Name [Character]: Surname and given name of employee.
Employee Number [Double]: Unique employee number.
Age [Double]: Employee age in years.
Pay Rate [Double]: Employee pay rate.
State [Character]: Employee’s state of residence.
Zip [Character]: Zip code of employee’s residence.
DOB [Character]: Employee’s date of birth.
Sex [Character]: Employee’s gender.
MaritalDesc [Character]: Employee’s marital status.
CitizenDesc [Character]: Employee’s citizenship status.
Hispanic/Latino [Character]: Information as to whether an employee identifies as Hispanic or Latino.
RaceDesc [Character]: Information regarding what race the employee identifies with.
Date of Hire [Character]: Date of employee’s hire.
Days Employed [Double]: Number of days in total that have passed since the employee’s date of hire.
Date of Termination [Character]: Date that employee’s employment was terminated.
Reason for Term [Character]: Reason for employee’s termination.
Employment Status [Character]: Employee’s status of employment.
Department [Character]: Information regarding what department an employee works in.
Position [Character]: Information regarding what position the employee occupies within the organisation.
Manager Name [Character]: Name of employee’s manager.
Employee Source [Character]: Information as to how the employee learned of the job opportunity leading to their employment.
Performance Score [Character]: Employee’s performance review score category.

In this section of the report the “HRDataset_v9” and “production_staff” datasets have been read into R. Following their reading, the first five observations of both data sets have been shown. Then, the two datasets have been merged using a semi-join, so as to only preserve the observations that are a match between the two original datasets. The new, joined dataset was then given the alias “Complete”.

setwd("C:/Users/Jaimee-Lee/Documents/R/Working Directory/Data Preprocessing/Data Sets/Assignment Three")
HRDataset<- read_csv("HRDataset_v9.csv")

## Parsed with column specification:
## cols(
##   .default = col_character(),
##   `Employee Number` = col_double(),
##   MarriedID = col_double(),
##   MaritalStatusID = col_double(),
##   GenderID = col_double(),
##   EmpStatus_ID = col_double(),
##   DeptID = col_double(),
##   Perf_ScoreID = col_double(),
##   Age = col_double(),
##   `Pay Rate` = col_double(),
##   Zip = col_double(),
##   `Days Employed` = col_double()
## )

## See spec(...) for full column specifications.

productionstaff<- read_csv("production_staff.csv")

## Parsed with column specification:
## cols(
##   `Employee Name` = col_character(),
##   `Race Desc` = col_character(),
##   `Date of Hire` = col_character(),
##   TermDate = col_character(),
##   `Reason for Term` = col_character(),
##   `Employment Status` = col_character(),
##   Department = col_character(),
##   Position = col_character(),
##   Pay = col_character(),
##   `Manager Name` = col_character(),
##   `Performance Score` = col_character(),
##   `Abutments/Hour Wk 1` = col_double(),
##   `Abutments/Hour Wk 2` = col_double(),
##   `Daily Error Rate` = col_double(),
##   `90-day Complaints` = col_double()
## )

head(HRDataset)

head(productionstaff)

Complete<- semi_join(HRDataset,productionstaff,by="Employee Name")

head(Complete)

As a last step, the dataset was subsetted using select() to exclude the identification variables (MarriedID, MaritalStatusID, GenderID, EmpStatus_ID, DeptID and Perf_ScoreID) as outlined above. The resulting subset was then also named “Complete”, so as to overwrite the join originally including the identification variables.

Complete <- Complete %>% select(., -c(MarriedID,MaritalStatusID,GenderID,EmpStatus_ID,DeptID,Perf_ScoreID))
head(Complete)

Understand

In the section below the subset Complete has been inspected using dim() and str(). The code shows that the dimensions of Complete are 209 observations by 22 variables. An outline of the structure of the data frame and the attributes is given using str(), which also includes information about the different data types of the variables.

dim(Complete)

## [1] 209  22

str(Complete)

## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 209 obs. of  22 variables:
##  $ Employee Name      : chr  "King, Janet" "Albert, Michael" "Bozzi, Charles" "Butler, Webster  L" ...
##  $ Employee Number    : num  1.00e+09 1.50e+09 1.30e+09 1.11e+09 1.41e+09 ...
##  $ Age                : num  63 49 48 34 44 36 45 41 36 36 ...
##  $ Pay Rate           : num  80 54.5 50.5 55 51 54 48.5 42 55 53 ...
##  $ State              : chr  "MA" "MA" "MA" "MA" ...
##  $ Zip                : num  1902 2169 1901 2169 1731 ...
##  $ DOB                : chr  "21/09/1954" "10/10/1968" "3/10/1970" "8/09/1983" ...
##  $ Sex                : chr  "Female" "Male" "Male" "Male" ...
##  $ MaritalDesc        : chr  "Married" "Divorced" "Single" "Single" ...
##  $ CitizenDesc        : chr  "US Citizen" "US Citizen" "US Citizen" "US Citizen" ...
##  $ Hispanic/Latino    : chr  "Yes" "No" "No" "No" ...
##  $ RaceDesc           : chr  "White" "White" "Asian" "White" ...
##  $ Date of Hire       : chr  "7/02/2012" "8/01/2011" "30/09/2013" "28/01/2016" ...
##  $ Days Employed      : num  1973 2309 311 668 1165 ...
##  $ Date of Termination: chr  NA NA "8/07/2014" NA ...
##  $ Reason For Term    : chr  "N/A - still employed" "N/A - still employed" "retiring" "N/A - still employed" ...
##  $ Employment Status  : chr  "Active" "Active" "Voluntarily Terminated" "Active" ...
##  $ Department         : chr  "Executive Office" "Production" "Production" "Production" ...
##  $ Position           : chr  "President & CEO" "Production Manager" "Production Manager" "Production Manager" ...
##  $ Manager Name       : chr  "Board of Directors" "Janet King" "Janet King" "Janet King" ...
##  $ Employee Source    : chr  "Pay Per Click - Google" "Employee Referral" "Billboard" "Pay Per Click - Google" ...
##  $ Performance Score  : chr  "Fully Meets" "Fully Meets" "Fully Meets" "Exceeds" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   `Employee Name` = col_character(),
##   ..   `Employee Number` = col_double(),
##   ..   MarriedID = col_double(),
##   ..   MaritalStatusID = col_double(),
##   ..   GenderID = col_double(),
##   ..   EmpStatus_ID = col_double(),
##   ..   DeptID = col_double(),
##   ..   Perf_ScoreID = col_double(),
##   ..   Age = col_double(),
##   ..   `Pay Rate` = col_double(),
##   ..   State = col_character(),
##   ..   Zip = col_double(),
##   ..   DOB = col_character(),
##   ..   Sex = col_character(),
##   ..   MaritalDesc = col_character(),
##   ..   CitizenDesc = col_character(),
##   ..   `Hispanic/Latino` = col_character(),
##   ..   RaceDesc = col_character(),
##   ..   `Date of Hire` = col_character(),
##   ..   `Days Employed` = col_double(),
##   ..   `Date of Termination` = col_character(),
##   ..   `Reason For Term` = col_character(),
##   ..   `Employment Status` = col_character(),
##   ..   Department = col_character(),
##   ..   Position = col_character(),
##   ..   `Manager Name` = col_character(),
##   ..   `Employee Source` = col_character(),
##   ..   `Performance Score` = col_character()
##   .. )

This inspection reveals that the variables Sex and Performance Score have been read into R as characters. It would be more ideal in terms of analysis, if these variables were factors. Therefore, the next piece of code illustrates the creation of factor levels for Sex (Male, Female) and Performance Score (N/A - too early to review, PIP, Needs Improvement, 90 - day meets, Fully Meets, Exceeds, Exceptional).

Complete$Sex<- factor(Complete$Sex, levels=c("Male","Female"),labels=c("Male","Female"))


Complete$`Performance Score`<- factor(Complete$`Performance Score`, levels=c("N/A- too early to review","PIP","Needs Improvement", "90-day meets", "Fully Meets", "Exceeds", "Exceptional" ),labels=c("N/A- too early to review","PIP","Needs Improvement", "90-day meets", "Fully Meets", "Exceeds", "Exceptional"), ordered = TRUE)

levels(Complete$Sex)

## [1] "Male"   "Female"

levels(Complete$`Performance Score`)

## [1] "N/A- too early to review" "PIP"                     
## [3] "Needs Improvement"        "90-day meets"            
## [5] "Fully Meets"              "Exceeds"                 
## [7] "Exceptional"

The inspection also revealed that the Date of Termination and Date of Hire variables were also read in as character strings. Therefore, these variables were converted into the date format to create ease of analysis.

Complete$`Date of Termination`<-dmy(Complete$`Date of Termination`)

head(Complete$`Date of Termination`)

## [1] NA           NA           "2014-07-08" NA           NA          
## [6] NA

Complete$`Date of Hire`<-dmy(Complete$`Date of Hire`)
head(Complete$`Date of Hire`)

## [1] "2012-02-07" "2011-01-08" "2013-09-30" "2016-01-28" "2014-09-18"
## [6] "2015-02-06"

Tidy & Manipulate Data I

The Complete data frame adheres to Hadley Wickham’s Tidy Data Principles. Each variable has its own column. The only column that could be considered to contain more than one variable would be that of Employee Name. Presently, it contains both the surname and given name of all employees, separated by a comma. Considering the context of the dataset, the decision was made to not separate the given names and surnames because it could be considered to be a singular string observation. The only reason a client may want these separated would be if they wanted to run some sort of analysis regarding family groups.

As can be seen in the data frame, each observation has its own row. Further to this, each value also has its own cell.

Considering that the data frame was in a relatively tidy format to begin with, no manipulation was undertaken.

Tidy & Manipulate Data II

The code below leads to the creation of a new variable i.e., MonthsEmployed, which divides the Days Employed values by twelve and therefore contains the number of months that each employee has been employed for.

mutate(Complete, MonthsEmployed = Complete$`Days Employed`/12)

Scan I

The following code scans all the variables for missing values and returns their count serially.

colSums(is.na(Complete))

##       Employee Name     Employee Number                 Age 
##                   0                   0                   0 
##            Pay Rate               State                 Zip 
##                   0                   0                   0 
##                 DOB                 Sex         MaritalDesc 
##                   0                   0                   0 
##         CitizenDesc     Hispanic/Latino            RaceDesc 
##                   0                   0                   0 
##        Date of Hire       Days Employed Date of Termination 
##                   0                   0                 126 
##     Reason For Term   Employment Status          Department 
##                   0                   0                   0 
##            Position        Manager Name     Employee Source 
##                   0                   0                   0 
##   Performance Score 
##                   0

As can be seen above, the only variable that contains missing values is the Date of Termination. No action should be taken here because the data just hasn’t been entered by whoever has compiled the dataset and it doesn’t affect our analysis.

Scan II

In this section, we consider only three numeric variables (Pay Rate, Age and Days Employed) and scan them for outliers. We disregard Employee Number in this scan as it is a variable to be used simply as a unique identifier and therefore the existence of its outliers is meaningless.

To visualize the outliers for the respective variables, we use boxplots.

For Pay Rate:

Outliers<-boxplot(Complete$`Pay Rate`)

Outliers$out

##  [1] 80.0 54.5 50.5 55.0 51.0 54.0 48.5 42.0 55.0 53.0 38.5 52.0 53.0 55.0

The output shows us that there are 14 outliers for this variable. However, we now proceed to filter out the value ‘80’ as that value represents the salary of the CIO, who will not be taken into consideration for most analyses, as he/she will inevitably skew the values.

The new filtered boxplot is obtained as follows:

NonCIOData<-filter(Complete,Complete$`Pay Rate`<80)
boxplot(NonCIOData$`Pay Rate`)

We now proceed to deal with the outliers by capping them. To do so, we use the following function:

cap <- function(x){
    quantiles <- quantile( x, c(.05, 0.25, 0.75, .95 ) )
    x[ x < quantiles[2] - 1.5*IQR(x) ] <- quantiles[1]
    x[ x > quantiles[3] + 1.5*IQR(x) ] <- quantiles[4]
    return(x)
}

CappedPay<-cap(NonCIOData$`Pay Rate`)
CappedPay

##   [1] 46.225 46.225 46.225 46.225 46.225 46.225 46.225 46.225 46.225 46.225
##  [11] 46.225 46.225 46.225 33.500 20.000 21.000 16.000 23.000 22.000 22.000
##  [21] 17.000 22.000 16.760 18.000 22.000 17.000 16.000 22.000 16.000 20.000
##  [31] 20.000 20.000 16.000 19.500 22.000 21.000 22.000 24.750 15.000 15.000
##  [41] 15.750 19.750 15.000 23.500 20.000 21.000 22.000 16.750 17.000 21.250
##  [51] 21.000 17.000 18.000 20.000 18.000 23.000 24.000 19.000 18.000 24.000
##  [61] 19.000 22.000 15.000 16.000 17.000 16.000 15.000 24.000 14.000 24.000
##  [71] 21.000 19.750 20.000 19.000 16.000 20.000 21.000 19.000 17.000 21.000
##  [81] 19.000 22.000 24.000 14.000 21.000 17.000 19.000 15.000 15.250 20.000
##  [91] 19.000 15.000 17.000 23.000 20.000 24.000 14.000 20.000 15.000 18.000
## [101] 22.000 21.000 15.000 19.000 18.500 16.000 22.000 18.000 24.000 22.000
## [111] 17.000 18.000 17.000 19.000 16.000 16.000 21.000 20.000 22.000 24.000
## [121] 16.000 17.000 17.000 18.000 20.000 21.000 21.000 15.200 15.000 24.500
## [131] 16.000 21.000 14.000 22.000 19.000 14.000 15.000 20.000 20.000 24.000
## [141] 22.000 19.000 21.000 20.000 15.000 23.000 21.000 16.000 22.000 15.000
## [151] 29.000 27.000 27.000 26.000 22.000 25.000 26.000 25.000 26.000 25.000
## [161] 29.000 24.000 27.000 24.000 29.000 24.000 29.000 24.250 27.000 24.000
## [171] 24.000 28.000 25.000 28.000 23.000 22.500 25.000 22.000 28.000 22.000
## [181] 24.000 26.100 23.000 23.000 26.000 25.000 27.000 25.000 26.000 29.000
## [191] 26.390 25.000 28.000 29.000 26.000 26.000 22.000 29.000 23.000 22.000
## [201] 28.750 23.000 22.000 22.000 29.000 22.000 24.250 27.000

As can be seen, the outliers have been dealt with through capping.

We now perform the same analysis for the variable Age :

Outliers2<-boxplot(Complete$Age)

Outliers2$out

## [1] 63 62 67 63 65 67 66 65

The variable has 8 outliers. However, we do not deal with these outliers as they aren’t indicators of extreme values that threaten to skew the data and they aren’t a result of an error (Experimental, human or otherwise).

We now perform the same analysis for the variable Days Employed

Outliers3<-boxplot(Complete$`Days Employed`)

Outliers3$out

## [1] 3611 3674

This variable has 2 outliers which could indicate extreme values that would skew the data. Therefore, we use the following function to cap said values.

cap <- function(x){
    quantiles <- quantile( x, c(.05, 0.25, 0.75, .95 ) )
    x[ x < quantiles[2] - 1.5*IQR(x) ] <- quantiles[1]
    x[ x > quantiles[3] + 1.5*IQR(x) ] <- quantiles[4]
    return(x)
}

CappedDays<-cap(Complete$`Days Employed`)
CappedDays

##   [1] 1973 2309  311  668 1165  908 1797  581 1518 1928 2032 1881 2686 3244
##  [15]   98 2336 2428   57 2148 1518 3065    2  170 2106  194 2428 1602 2064
##  [29] 1560 1238  439 1476 1336 1154 1294 1602 1602 2246  508 1602  440 1056
##  [43]  508 1973 1484 1238 1294 1784 2064  794 1112 1336  908    8  922  930
##  [57] 2106 2148  972 1602 2386  972    8 1179 1154 1476 2330 1973 1560 1055
##  [71] 2190 2022 1436 1294 1931 1560 1847 1518 1420 1635 1908 1518 1347  105
##  [85] 1378 2512 1966  693  164   83 1014 1468    2 1420 1476 2371 1784 1318
##  [99]  972 1700  921 1150 1294 1700 1602  762 1518 1378 1070  718 1623 2771
## [113]  514 2428   19 1420  267  194 1705 2190 1842 1954 1602 1420  497  510
## [127]  571 1154 1518 1560 1476 2106  777 2512 1378 1056 2022 1435   62  124
## [141] 1742 1378 1271 1596  875 1114  125 1575 1265 2583 1154  447  493 2428
## [155] 1180 1154 2092 2428  602  392 2211 1675 1238 1238 1378  770  280  399
## [169] 1294 1518 1476  770  236  538  905  770 2336 1238  448  378 1602   69
## [183] 1560  564   45  972 1116 2211 1658 2148 1658 1112  462   72  264 2064
## [197]   26  218 2645 1990 1154  419 1162  539 1196 1140 1154 1238 2295

As can be see above, the outliers have now been dealt with.

Transform

When we plot a histogram of the variable Pay Rate, we see that it’s right-skewed.

hist(Complete$`Pay Rate`)

To reduce this skewness we proceed to transform the variable using a Reciprocal Transformation as follows:

InvPay<-1/Complete$`Pay Rate`
hist(InvPay)

As can be seen above, the transformed data is now drastically more symmetrical than before.