The following packages were utilized in the the proceeding project.
library(readr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(lubridate)
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
##
## date
Data preprocessing is a vital step in any sort of data analysis, due to the quality of the analysis being highly dependent on the quality of the data it utilizes. For this purpose, the Human Resources Dataset will be preprocessed in the following project. In order to achieve a dataset that is suitable for analysis, a number of preprocessing tasks will be undertaken. First, the data will be inspected so that it’s dimensions, attributes and structure may be understood. Consequently, any necessary data type conversions will be performed. This step will include the conversion of character strings in to ordered factors where necessary, and the conversion of character strings that contain date related information into date type data. Following these preliminary steps, the data will be checked to determine whether or not it adhere’s to Hadley Whickam’s Tidy Data Principles, it will then be modified as necessary.
Further, a mutation will be performed on the Days Employed variable, to create a Months Employed variable. The dataset will then be further checked for missing values, inconsistencies and errors and if found, these issues will be dealt with accordingly. Following this, a similar course will be taken with regard to outliers that may be present in the numerical variables in the data frame. Finally, a transformation will be performed in order to coerce the Pay Rate variable into a greater degree of normality for statistical analysis. At the conclusion of the aforementioned processes, the data will be ready for analysis.
The following report is based on a merging of two datasets within the Human Resources Dataset. The datasets were retrieved from Kaggle and can be found at https://www.kaggle.com/rhuebner/human-resources-data-set/downloads/human-resources-data-set.zip/4#HRDataset_v9.csv. The “HRDataset_v9” and “production_staff” tables were created through a semi join on the Employee Name variable.
Further to the merging of the two datasets, a subset has been created to exclude the follow variables: MarriedID, MaritalStatusID, GenderID, EmpStatus_ID, DeptID and Perf_ScoreID. This decision was made due to the fact that all of the variables in question contained ID numbers that did not contain information that could not be determined elsewhere in the joined dataset.
Therefore the original join contained 209 observations of 28 variables. The resulting subset then contained 209 observations of 22 variables detailing various details of the employees.
The subset of data used in this project includes both personal details of employees along with details regarding their employment. The full set of variables for the subset before pre-processing are listed as follows:
In this section of the report the “HRDataset_v9” and “production_staff” datasets have been read into R. Following their reading, the first five observations of both data sets have been shown. Then, the two datasets have been merged using a semi-join, so as to only preserve the observations that are a match between the two original datasets. The new, joined dataset was then given the alias “Complete”.
setwd("C:/Users/Jaimee-Lee/Documents/R/Working Directory/Data Preprocessing/Data Sets/Assignment Three")
HRDataset<- read_csv("HRDataset_v9.csv")
## Parsed with column specification:
## cols(
## .default = col_character(),
## `Employee Number` = col_double(),
## MarriedID = col_double(),
## MaritalStatusID = col_double(),
## GenderID = col_double(),
## EmpStatus_ID = col_double(),
## DeptID = col_double(),
## Perf_ScoreID = col_double(),
## Age = col_double(),
## `Pay Rate` = col_double(),
## Zip = col_double(),
## `Days Employed` = col_double()
## )
## See spec(...) for full column specifications.
productionstaff<- read_csv("production_staff.csv")
## Parsed with column specification:
## cols(
## `Employee Name` = col_character(),
## `Race Desc` = col_character(),
## `Date of Hire` = col_character(),
## TermDate = col_character(),
## `Reason for Term` = col_character(),
## `Employment Status` = col_character(),
## Department = col_character(),
## Position = col_character(),
## Pay = col_character(),
## `Manager Name` = col_character(),
## `Performance Score` = col_character(),
## `Abutments/Hour Wk 1` = col_double(),
## `Abutments/Hour Wk 2` = col_double(),
## `Daily Error Rate` = col_double(),
## `90-day Complaints` = col_double()
## )
head(HRDataset)
head(productionstaff)
Complete<- semi_join(HRDataset,productionstaff,by="Employee Name")
head(Complete)
As a last step, the dataset was subsetted using select() to exclude the identification variables (MarriedID, MaritalStatusID, GenderID, EmpStatus_ID, DeptID and Perf_ScoreID) as outlined above. The resulting subset was then also named “Complete”, so as to overwrite the join originally including the identification variables.
Complete <- Complete %>% select(., -c(MarriedID,MaritalStatusID,GenderID,EmpStatus_ID,DeptID,Perf_ScoreID))
head(Complete)
In the section below the subset Complete has been inspected using dim() and str(). The code shows that the dimensions of Complete are 209 observations by 22 variables. An outline of the structure of the data frame and the attributes is given using str(), which also includes information about the different data types of the variables.
dim(Complete)
## [1] 209 22
str(Complete)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 209 obs. of 22 variables:
## $ Employee Name : chr "King, Janet" "Albert, Michael" "Bozzi, Charles" "Butler, Webster L" ...
## $ Employee Number : num 1.00e+09 1.50e+09 1.30e+09 1.11e+09 1.41e+09 ...
## $ Age : num 63 49 48 34 44 36 45 41 36 36 ...
## $ Pay Rate : num 80 54.5 50.5 55 51 54 48.5 42 55 53 ...
## $ State : chr "MA" "MA" "MA" "MA" ...
## $ Zip : num 1902 2169 1901 2169 1731 ...
## $ DOB : chr "21/09/1954" "10/10/1968" "3/10/1970" "8/09/1983" ...
## $ Sex : chr "Female" "Male" "Male" "Male" ...
## $ MaritalDesc : chr "Married" "Divorced" "Single" "Single" ...
## $ CitizenDesc : chr "US Citizen" "US Citizen" "US Citizen" "US Citizen" ...
## $ Hispanic/Latino : chr "Yes" "No" "No" "No" ...
## $ RaceDesc : chr "White" "White" "Asian" "White" ...
## $ Date of Hire : chr "7/02/2012" "8/01/2011" "30/09/2013" "28/01/2016" ...
## $ Days Employed : num 1973 2309 311 668 1165 ...
## $ Date of Termination: chr NA NA "8/07/2014" NA ...
## $ Reason For Term : chr "N/A - still employed" "N/A - still employed" "retiring" "N/A - still employed" ...
## $ Employment Status : chr "Active" "Active" "Voluntarily Terminated" "Active" ...
## $ Department : chr "Executive Office" "Production" "Production" "Production" ...
## $ Position : chr "President & CEO" "Production Manager" "Production Manager" "Production Manager" ...
## $ Manager Name : chr "Board of Directors" "Janet King" "Janet King" "Janet King" ...
## $ Employee Source : chr "Pay Per Click - Google" "Employee Referral" "Billboard" "Pay Per Click - Google" ...
## $ Performance Score : chr "Fully Meets" "Fully Meets" "Fully Meets" "Exceeds" ...
## - attr(*, "spec")=
## .. cols(
## .. `Employee Name` = col_character(),
## .. `Employee Number` = col_double(),
## .. MarriedID = col_double(),
## .. MaritalStatusID = col_double(),
## .. GenderID = col_double(),
## .. EmpStatus_ID = col_double(),
## .. DeptID = col_double(),
## .. Perf_ScoreID = col_double(),
## .. Age = col_double(),
## .. `Pay Rate` = col_double(),
## .. State = col_character(),
## .. Zip = col_double(),
## .. DOB = col_character(),
## .. Sex = col_character(),
## .. MaritalDesc = col_character(),
## .. CitizenDesc = col_character(),
## .. `Hispanic/Latino` = col_character(),
## .. RaceDesc = col_character(),
## .. `Date of Hire` = col_character(),
## .. `Days Employed` = col_double(),
## .. `Date of Termination` = col_character(),
## .. `Reason For Term` = col_character(),
## .. `Employment Status` = col_character(),
## .. Department = col_character(),
## .. Position = col_character(),
## .. `Manager Name` = col_character(),
## .. `Employee Source` = col_character(),
## .. `Performance Score` = col_character()
## .. )
This inspection reveals that the variables Sex and Performance Score have been read into R as characters. It would be more ideal in terms of analysis, if these variables were factors. Therefore, the next piece of code illustrates the creation of factor levels for Sex (Male, Female) and Performance Score (N/A - too early to review, PIP, Needs Improvement, 90 - day meets, Fully Meets, Exceeds, Exceptional).
Complete$Sex<- factor(Complete$Sex, levels=c("Male","Female"),labels=c("Male","Female"))
Complete$`Performance Score`<- factor(Complete$`Performance Score`, levels=c("N/A- too early to review","PIP","Needs Improvement", "90-day meets", "Fully Meets", "Exceeds", "Exceptional" ),labels=c("N/A- too early to review","PIP","Needs Improvement", "90-day meets", "Fully Meets", "Exceeds", "Exceptional"), ordered = TRUE)
levels(Complete$Sex)
## [1] "Male" "Female"
levels(Complete$`Performance Score`)
## [1] "N/A- too early to review" "PIP"
## [3] "Needs Improvement" "90-day meets"
## [5] "Fully Meets" "Exceeds"
## [7] "Exceptional"
The inspection also revealed that the Date of Termination and Date of Hire variables were also read in as character strings. Therefore, these variables were converted into the date format to create ease of analysis.
Complete$`Date of Termination`<-dmy(Complete$`Date of Termination`)
head(Complete$`Date of Termination`)
## [1] NA NA "2014-07-08" NA NA
## [6] NA
Complete$`Date of Hire`<-dmy(Complete$`Date of Hire`)
head(Complete$`Date of Hire`)
## [1] "2012-02-07" "2011-01-08" "2013-09-30" "2016-01-28" "2014-09-18"
## [6] "2015-02-06"
The Complete data frame adheres to Hadley Wickham’s Tidy Data Principles. Each variable has its own column. The only column that could be considered to contain more than one variable would be that of Employee Name. Presently, it contains both the surname and given name of all employees, separated by a comma. Considering the context of the dataset, the decision was made to not separate the given names and surnames because it could be considered to be a singular string observation. The only reason a client may want these separated would be if they wanted to run some sort of analysis regarding family groups.
As can be seen in the data frame, each observation has its own row. Further to this, each value also has its own cell.
Considering that the data frame was in a relatively tidy format to begin with, no manipulation was undertaken.
The code below leads to the creation of a new variable i.e., MonthsEmployed, which divides the Days Employed values by twelve and therefore contains the number of months that each employee has been employed for.
mutate(Complete, MonthsEmployed = Complete$`Days Employed`/12)
The following code scans all the variables for missing values and returns their count serially.
colSums(is.na(Complete))
## Employee Name Employee Number Age
## 0 0 0
## Pay Rate State Zip
## 0 0 0
## DOB Sex MaritalDesc
## 0 0 0
## CitizenDesc Hispanic/Latino RaceDesc
## 0 0 0
## Date of Hire Days Employed Date of Termination
## 0 0 126
## Reason For Term Employment Status Department
## 0 0 0
## Position Manager Name Employee Source
## 0 0 0
## Performance Score
## 0
As can be seen above, the only variable that contains missing values is the Date of Termination. No action should be taken here because the data just hasn’t been entered by whoever has compiled the dataset and it doesn’t affect our analysis.
In this section, we consider only three numeric variables (Pay Rate, Age and Days Employed) and scan them for outliers. We disregard Employee Number in this scan as it is a variable to be used simply as a unique identifier and therefore the existence of its outliers is meaningless.
To visualize the outliers for the respective variables, we use boxplots.
For Pay Rate:
Outliers<-boxplot(Complete$`Pay Rate`)
Outliers$out
## [1] 80.0 54.5 50.5 55.0 51.0 54.0 48.5 42.0 55.0 53.0 38.5 52.0 53.0 55.0
The output shows us that there are 14 outliers for this variable. However, we now proceed to filter out the value ‘80’ as that value represents the salary of the CIO, who will not be taken into consideration for most analyses, as he/she will inevitably skew the values.
The new filtered boxplot is obtained as follows:
NonCIOData<-filter(Complete,Complete$`Pay Rate`<80)
boxplot(NonCIOData$`Pay Rate`)
We now proceed to deal with the outliers by capping them. To do so, we use the following function:
cap <- function(x){
quantiles <- quantile( x, c(.05, 0.25, 0.75, .95 ) )
x[ x < quantiles[2] - 1.5*IQR(x) ] <- quantiles[1]
x[ x > quantiles[3] + 1.5*IQR(x) ] <- quantiles[4]
return(x)
}
CappedPay<-cap(NonCIOData$`Pay Rate`)
CappedPay
## [1] 46.225 46.225 46.225 46.225 46.225 46.225 46.225 46.225 46.225 46.225
## [11] 46.225 46.225 46.225 33.500 20.000 21.000 16.000 23.000 22.000 22.000
## [21] 17.000 22.000 16.760 18.000 22.000 17.000 16.000 22.000 16.000 20.000
## [31] 20.000 20.000 16.000 19.500 22.000 21.000 22.000 24.750 15.000 15.000
## [41] 15.750 19.750 15.000 23.500 20.000 21.000 22.000 16.750 17.000 21.250
## [51] 21.000 17.000 18.000 20.000 18.000 23.000 24.000 19.000 18.000 24.000
## [61] 19.000 22.000 15.000 16.000 17.000 16.000 15.000 24.000 14.000 24.000
## [71] 21.000 19.750 20.000 19.000 16.000 20.000 21.000 19.000 17.000 21.000
## [81] 19.000 22.000 24.000 14.000 21.000 17.000 19.000 15.000 15.250 20.000
## [91] 19.000 15.000 17.000 23.000 20.000 24.000 14.000 20.000 15.000 18.000
## [101] 22.000 21.000 15.000 19.000 18.500 16.000 22.000 18.000 24.000 22.000
## [111] 17.000 18.000 17.000 19.000 16.000 16.000 21.000 20.000 22.000 24.000
## [121] 16.000 17.000 17.000 18.000 20.000 21.000 21.000 15.200 15.000 24.500
## [131] 16.000 21.000 14.000 22.000 19.000 14.000 15.000 20.000 20.000 24.000
## [141] 22.000 19.000 21.000 20.000 15.000 23.000 21.000 16.000 22.000 15.000
## [151] 29.000 27.000 27.000 26.000 22.000 25.000 26.000 25.000 26.000 25.000
## [161] 29.000 24.000 27.000 24.000 29.000 24.000 29.000 24.250 27.000 24.000
## [171] 24.000 28.000 25.000 28.000 23.000 22.500 25.000 22.000 28.000 22.000
## [181] 24.000 26.100 23.000 23.000 26.000 25.000 27.000 25.000 26.000 29.000
## [191] 26.390 25.000 28.000 29.000 26.000 26.000 22.000 29.000 23.000 22.000
## [201] 28.750 23.000 22.000 22.000 29.000 22.000 24.250 27.000
As can be seen, the outliers have been dealt with through capping.
We now perform the same analysis for the variable Age :
Outliers2<-boxplot(Complete$Age)
Outliers2$out
## [1] 63 62 67 63 65 67 66 65
The variable has 8 outliers. However, we do not deal with these outliers as they aren’t indicators of extreme values that threaten to skew the data and they aren’t a result of an error (Experimental, human or otherwise).
We now perform the same analysis for the variable Days Employed
Outliers3<-boxplot(Complete$`Days Employed`)
Outliers3$out
## [1] 3611 3674
This variable has 2 outliers which could indicate extreme values that would skew the data. Therefore, we use the following function to cap said values.
cap <- function(x){
quantiles <- quantile( x, c(.05, 0.25, 0.75, .95 ) )
x[ x < quantiles[2] - 1.5*IQR(x) ] <- quantiles[1]
x[ x > quantiles[3] + 1.5*IQR(x) ] <- quantiles[4]
return(x)
}
CappedDays<-cap(Complete$`Days Employed`)
CappedDays
## [1] 1973 2309 311 668 1165 908 1797 581 1518 1928 2032 1881 2686 3244
## [15] 98 2336 2428 57 2148 1518 3065 2 170 2106 194 2428 1602 2064
## [29] 1560 1238 439 1476 1336 1154 1294 1602 1602 2246 508 1602 440 1056
## [43] 508 1973 1484 1238 1294 1784 2064 794 1112 1336 908 8 922 930
## [57] 2106 2148 972 1602 2386 972 8 1179 1154 1476 2330 1973 1560 1055
## [71] 2190 2022 1436 1294 1931 1560 1847 1518 1420 1635 1908 1518 1347 105
## [85] 1378 2512 1966 693 164 83 1014 1468 2 1420 1476 2371 1784 1318
## [99] 972 1700 921 1150 1294 1700 1602 762 1518 1378 1070 718 1623 2771
## [113] 514 2428 19 1420 267 194 1705 2190 1842 1954 1602 1420 497 510
## [127] 571 1154 1518 1560 1476 2106 777 2512 1378 1056 2022 1435 62 124
## [141] 1742 1378 1271 1596 875 1114 125 1575 1265 2583 1154 447 493 2428
## [155] 1180 1154 2092 2428 602 392 2211 1675 1238 1238 1378 770 280 399
## [169] 1294 1518 1476 770 236 538 905 770 2336 1238 448 378 1602 69
## [183] 1560 564 45 972 1116 2211 1658 2148 1658 1112 462 72 264 2064
## [197] 26 218 2645 1990 1154 419 1162 539 1196 1140 1154 1238 2295
As can be see above, the outliers have now been dealt with.
When we plot a histogram of the variable Pay Rate, we see that it’s right-skewed.
hist(Complete$`Pay Rate`)
To reduce this skewness we proceed to transform the variable using a Reciprocal Transformation as follows:
InvPay<-1/Complete$`Pay Rate`
hist(InvPay)
As can be seen above, the transformed data is now drastically more symmetrical than before.