The required packages and libraries are installed for the preprocessing.
library(readr)
library(dplyr)
library(lubridate)
library(tidyverse)
library(outliers)
library(forecast)
library(zoo)
library(scales)
library(ggplot2)
library(Hmisc)
The dataset used here is HR Analytics Case study which is originally retreived from kaggle (https://www.kaggle.com/vjchoudhary7/hr-analytics-case-study#general_data.csv). There are totally six files out of which five are csv files and one xlsx file which is a data dictionary.
For the purpose of preprocessing, we have made use of three datasets which are general_data,employee_survey_data and manager_survey_data. All the three datasets had their own inconsistencies and all of those were handle. All the three datasets were merged in a two stage manner. Before merging them, we labelled and ordered the variables as it would ease up the preprocessing process. All the three datasets had a common key variable which is EmployeeID.The dataset didnt go any reshaping since the rules of a tidy dataset was met. A new column PerHourRate has been mutated to find the per hour pay of an employee from the monthly income.
Then the dataset was checked for NaN values and all of them were replaced by mean and mode.Box Plots were plotted to find the outliers and all of them were handled using z score analysis and capping technique. Finally the numerical variables were transformed using various transformation techniques and the best one was chosen based on the possible skewness reduction.
Dataset source : https://www.kaggle.com/vjchoudhary7/hr-analytics-case-study#general_data.csv Below are the variable list for general_data. “Age Attrition BusinessTravel Department DistanceFromHome Education EducationField (1-”Below College“,2-”College“,3-”Bachelor“,4-”Master“,5-”Doctor“) EmployeeCount EmployeeID Gender JobLevel JobRole MaritalStatus MonthlyIncome NumCompaniesWorked Over18 PercentSalaryHike StandardHours StockOptionLevel TotalWorkingYears TrainingTimesLastYear YearsAtCompany YearsSinceLastPromotion YearsWithCurrManager”
Below are the variable list for employee_survey_data
EmployeeID-Employee number/id EnvironmentSatisfaction-Work Environment Satisfaction Level (1-“Low”,2-“Medium”,3-“High”,4-“Very High”) JobSatisfaction-Job Involvement Level (1-“Low”,2-“Medium”,3-“High”,4-“Very High”) WorkLifeBalance-Work life balance level
Below are the variable list for manager_survey_data
EmployeeID JobInvolvement (1-“Low”,2-“Medium”,3-“High”,4-“Very High”) PerformanceRating (1-“Low”,2-“Good”,3-“Excellent”,4-“Outstanding”)
general_data <- read_csv("/Users/vicky/Downloads/hr-analytics-case-study/general_data.csv")
Parsed with column specification:
cols(
.default = col_double(),
Attrition = [31mcol_character()[39m,
BusinessTravel = [31mcol_character()[39m,
Department = [31mcol_character()[39m,
EducationField = [31mcol_character()[39m,
Gender = [31mcol_character()[39m,
JobRole = [31mcol_character()[39m,
MaritalStatus = [31mcol_character()[39m,
Over18 = [31mcol_character()[39m
)
See spec(...) for full column specifications.
employee_survey_data <- read_csv("/Users/vicky/Downloads/hr-analytics-case-study/employee_survey_data.csv")
Parsed with column specification:
cols(
EmployeeID = [32mcol_double()[39m,
EnvironmentSatisfaction = [32mcol_double()[39m,
JobSatisfaction = [32mcol_double()[39m,
WorkLifeBalance = [32mcol_double()[39m
)
manager_survey_data <- read_csv("/Users/vicky/Downloads/hr-analytics-case-study/manager_survey_data.csv")
Parsed with column specification:
cols(
EmployeeID = [32mcol_double()[39m,
JobInvolvement = [32mcol_double()[39m,
PerformanceRating = [32mcol_double()[39m
)
head(general_data)
head(employee_survey_data)
head(manager_survey_data)
str(general_data)
Classes ‘spec_tbl_df’, ‘tbl_df’, ‘tbl’ and 'data.frame': 4410 obs. of 24 variables:
$ Age : num 51 31 32 38 32 46 28 29 31 25 ...
$ Attrition : chr "No" "Yes" "No" "No" ...
$ BusinessTravel : chr "Travel_Rarely" "Travel_Frequently" "Travel_Frequently" "Non-Travel" ...
$ Department : chr "Sales" "Research & Development" "Research & Development" "Research & Development" ...
$ DistanceFromHome : num 6 10 17 2 10 8 11 18 1 7 ...
$ Education : num 2 1 4 5 1 3 2 3 3 4 ...
$ EducationField : chr "Life Sciences" "Life Sciences" "Other" "Life Sciences" ...
$ EmployeeCount : num 1 1 1 1 1 1 1 1 1 1 ...
$ EmployeeID : num 1 2 3 4 5 6 7 8 9 10 ...
$ Gender : chr "Female" "Female" "Male" "Male" ...
$ JobLevel : num 1 1 4 3 1 4 2 2 3 4 ...
$ JobRole : chr "Healthcare Representative" "Research Scientist" "Sales Executive" "Human Resources" ...
$ MaritalStatus : chr "Married" "Single" "Married" "Married" ...
$ MonthlyIncome : num 131160 41890 193280 83210 23420 ...
$ NumCompaniesWorked : num 1 0 1 3 4 3 2 2 0 1 ...
$ Over18 : chr "Y" "Y" "Y" "Y" ...
$ PercentSalaryHike : num 11 23 15 11 12 13 20 22 21 13 ...
$ StandardHours : num 8 8 8 8 8 8 8 8 8 8 ...
$ StockOptionLevel : num 0 1 3 3 2 0 1 3 0 1 ...
$ TotalWorkingYears : num 1 6 5 13 9 28 5 10 10 6 ...
$ TrainingTimesLastYear : num 6 3 2 5 2 5 2 2 2 2 ...
$ YearsAtCompany : num 1 5 5 8 6 7 0 0 9 6 ...
$ YearsSinceLastPromotion: num 0 1 0 7 0 7 0 0 7 1 ...
$ YearsWithCurrManager : num 0 4 3 5 4 7 0 0 8 5 ...
- attr(*, "spec")=
.. cols(
.. Age = [32mcol_double()[39m,
.. Attrition = [31mcol_character()[39m,
.. BusinessTravel = [31mcol_character()[39m,
.. Department = [31mcol_character()[39m,
.. DistanceFromHome = [32mcol_double()[39m,
.. Education = [32mcol_double()[39m,
.. EducationField = [31mcol_character()[39m,
.. EmployeeCount = [32mcol_double()[39m,
.. EmployeeID = [32mcol_double()[39m,
.. Gender = [31mcol_character()[39m,
.. JobLevel = [32mcol_double()[39m,
.. JobRole = [31mcol_character()[39m,
.. MaritalStatus = [31mcol_character()[39m,
.. MonthlyIncome = [32mcol_double()[39m,
.. NumCompaniesWorked = [32mcol_double()[39m,
.. Over18 = [31mcol_character()[39m,
.. PercentSalaryHike = [32mcol_double()[39m,
.. StandardHours = [32mcol_double()[39m,
.. StockOptionLevel = [32mcol_double()[39m,
.. TotalWorkingYears = [32mcol_double()[39m,
.. TrainingTimesLastYear = [32mcol_double()[39m,
.. YearsAtCompany = [32mcol_double()[39m,
.. YearsSinceLastPromotion = [32mcol_double()[39m,
.. YearsWithCurrManager = [32mcol_double()[39m
.. )
str(employee_survey_data)
Classes ‘spec_tbl_df’, ‘tbl_df’, ‘tbl’ and 'data.frame': 4410 obs. of 4 variables:
$ EmployeeID : num 1 2 3 4 5 6 7 8 9 10 ...
$ EnvironmentSatisfaction: num 3 3 2 4 4 3 1 1 2 2 ...
$ JobSatisfaction : num 4 2 2 4 1 2 3 2 4 1 ...
$ WorkLifeBalance : num 2 4 1 3 3 2 1 3 3 3 ...
- attr(*, "spec")=
.. cols(
.. EmployeeID = [32mcol_double()[39m,
.. EnvironmentSatisfaction = [32mcol_double()[39m,
.. JobSatisfaction = [32mcol_double()[39m,
.. WorkLifeBalance = [32mcol_double()[39m
.. )
str(manager_survey_data)
Classes ‘spec_tbl_df’, ‘tbl_df’, ‘tbl’ and 'data.frame': 4410 obs. of 3 variables:
$ EmployeeID : num 1 2 3 4 5 6 7 8 9 10 ...
$ JobInvolvement : num 3 2 3 2 3 3 3 3 3 3 ...
$ PerformanceRating: num 3 4 3 3 3 3 4 4 4 3 ...
- attr(*, "spec")=
.. cols(
.. EmployeeID = [32mcol_double()[39m,
.. JobInvolvement = [32mcol_double()[39m,
.. PerformanceRating = [32mcol_double()[39m
.. )
Here all the three datasets are seperately labelled and ordered. After that all of them are merged by common key variable EmployeeID. A new dataframe is created by subsetting the most significant variables for further analysis.
general_data$Education <- factor(general_data$Education,levels = c('1','2','3','4','5'),
labels = c('Below College','College','Bachelor','Master','Doctor'),ordered = TRUE)
employee_survey_data$JobSatisfaction <- factor(employee_survey_data$JobSatisfaction,
levels = c('1','2','3','4'),
labels = c('Low','Medium','High','Very High'),
ordered = TRUE)
employee_survey_data$WorkLifeBalance<- factor(employee_survey_data$WorkLifeBalance,
levels = c('1','2','3','4'),
labels = c('Bad','Good','Better','Best'),
ordered = TRUE)
employee_survey_data$EnvironmentSatisfaction<- factor(employee_survey_data$EnvironmentSatisfaction,
levels = c('1','2','3','4'),
labels = c('Low','Medium','High','Very high'),
ordered = TRUE)
manager_survey_data$JobInvolvement <- factor(manager_survey_data$JobInvolvement,
levels = c('1','2','3','4'),
labels = c('Low','Medium','High','Very High'),
ordered = TRUE)
manager_survey_data$PerformanceRating <- factor(manager_survey_data$PerformanceRating,
levels = c('1','2','3','4'),
labels = c('Low','Good','Excellent','Outstanding'),
ordered = TRUE)
e1 <- merge(employee_survey_data,manager_survey_data,by="EmployeeID")
e2 <- merge(e1,general_data,by="EmployeeID")
e1e2 <- e2[c(1,3,5,6,7,8,9,10,12,15,17,18,19,20,22,23,25,27)]
dim(e1e2)
[1] 4410 18
Each variable has its own column, observation, row and its own cell. Hence the rules for a tidy dataset is met and therefore there is no need to reshape.
glimpse(e1e2)
Observations: 4,410
Variables: 18
$ EmployeeID [3m[38;5;246m<dbl>[39m[23m 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 1…
$ JobSatisfaction [3m[38;5;246m<ord>[39m[23m Very High, Medium, Medium, Very High, Low, Medium, High, Medium,…
$ JobInvolvement [3m[38;5;246m<ord>[39m[23m High, Medium, High, Medium, High, High, High, High, High, High, …
$ PerformanceRating [3m[38;5;246m<ord>[39m[23m Excellent, Outstanding, Excellent, Excellent, Excellent, Excelle…
$ Age [3m[38;5;246m<dbl>[39m[23m 51, 31, 32, 38, 32, 46, 28, 29, 31, 25, 45, 36, 55, 47, 28, 37, …
$ Attrition [3m[38;5;246m<chr>[39m[23m "No", "Yes", "No", "No", "No", "No", "Yes", "No", "No", "No", "N…
$ BusinessTravel [3m[38;5;246m<chr>[39m[23m "Travel_Rarely", "Travel_Frequently", "Travel_Frequently", "Non-…
$ Department [3m[38;5;246m<chr>[39m[23m "Sales", "Research & Development", "Research & Development", "Re…
$ Education [3m[38;5;246m<ord>[39m[23m College, Below College, Master, Doctor, Below College, Bachelor,…
$ Gender [3m[38;5;246m<chr>[39m[23m "Female", "Female", "Male", "Male", "Male", "Female", "Male", "M…
$ JobRole [3m[38;5;246m<chr>[39m[23m "Healthcare Representative", "Research Scientist", "Sales Execut…
$ MaritalStatus [3m[38;5;246m<chr>[39m[23m "Married", "Single", "Married", "Married", "Single", "Married", …
$ MonthlyIncome [3m[38;5;246m<dbl>[39m[23m 131160, 41890, 193280, 83210, 23420, 40710, 58130, 31430, 20440,…
$ NumCompaniesWorked [3m[38;5;246m<dbl>[39m[23m 1, 0, 1, 3, 4, 3, 2, 2, 0, 1, 0, 0, 0, 1, 1, 4, 1, 2, 7, 1, 1, 3…
$ PercentSalaryHike [3m[38;5;246m<dbl>[39m[23m 11, 23, 15, 11, 12, 13, 20, 22, 21, 13, 13, 12, 17, 11, 14, 11, …
$ StandardHours [3m[38;5;246m<dbl>[39m[23m 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8…
$ TotalWorkingYears [3m[38;5;246m<dbl>[39m[23m 1, 6, 5, 13, 9, 28, 5, 10, 10, 6, 21, 16, 37, 10, 5, 7, 3, 15, 1…
$ YearsAtCompany [3m[38;5;246m<dbl>[39m[23m 1, 5, 5, 8, 6, 7, 0, 0, 9, 6, 20, 15, 36, 10, 5, 5, 3, 5, 7, 8, …
head(e1e2)
tail(e1e2)
Here we scan for the missing values. First we check the missing values in the entire dataset by taking the sum of missing values in each column using colSums(is.na()) function. Then we check the percentage of missing values. Since the percentage is not huge we can either omit them or replace them by mean, median or mode. Here we decide to replace them rather omiting them. First we create a dummy variable for the missing variables and we define a mode function and replace all of them by mode as the variables are ordinal.
colSums(is.na(e1e2))
EmployeeID JobSatisfaction JobInvolvement PerformanceRating
0 20 0 0
Age Attrition BusinessTravel Department
0 0 0 0
Education Gender JobRole MaritalStatus
0 0 0 0
MonthlyIncome NumCompaniesWorked PercentSalaryHike StandardHours
0 19 0 0
TotalWorkingYears YearsAtCompany
9 0
percent(colSums(is.na(e1e2))[which(colSums(is.na(e1e2))>0)] / length(unique(e1e2$EmployeeID)))
[1] "0.454%" "0.431%" "0.204%"
na_NumOfComp <- which(is.na(e1e2$NumCompaniesWorked))
na_JobSat <- which(is.na(e1e2$JobSatisfaction))
na_TotWork <- which(is.na(e1e2$TotalWorkingYears))
mode<-function(v){
uniqv <- unique(v)
uniqv[which.max(tabulate(match(v,uniqv)))]
}
e1e2$NumCompaniesWorked[na_NumOfComp] <- mode(e1e2$NumCompaniesWorked)
e1e2$JobSatisfaction[na_JobSat] <- mode(e1e2$JobSatisfaction)
e1e2$TotalWorkingYears[na_TotWork] <- mode(e1e2$TotalWorkingYears)
str(e1e2)
'data.frame': 4410 obs. of 18 variables:
$ EmployeeID : num 1 2 3 4 5 6 7 8 9 10 ...
$ JobSatisfaction : Ord.factor w/ 4 levels "Low"<"Medium"<..: 4 2 2 4 1 2 3 2 4 1 ...
$ JobInvolvement : Ord.factor w/ 4 levels "Low"<"Medium"<..: 3 2 3 2 3 3 3 3 3 3 ...
$ PerformanceRating : Ord.factor w/ 4 levels "Low"<"Good"<"Excellent"<..: 3 4 3 3 3 3 4 4 4 3 ...
$ Age : num 51 31 32 38 32 46 28 29 31 25 ...
$ Attrition : chr "No" "Yes" "No" "No" ...
$ BusinessTravel : chr "Travel_Rarely" "Travel_Frequently" "Travel_Frequently" "Non-Travel" ...
$ Department : chr "Sales" "Research & Development" "Research & Development" "Research & Development" ...
$ Education : Ord.factor w/ 5 levels "Below College"<..: 2 1 4 5 1 3 2 3 3 4 ...
$ Gender : chr "Female" "Female" "Male" "Male" ...
$ JobRole : chr "Healthcare Representative" "Research Scientist" "Sales Executive" "Human Resources" ...
$ MaritalStatus : chr "Married" "Single" "Married" "Married" ...
$ MonthlyIncome : num 131160 41890 193280 83210 23420 ...
$ NumCompaniesWorked: num 1 0 1 3 4 3 2 2 0 1 ...
$ PercentSalaryHike : num 11 23 15 11 12 13 20 22 21 13 ...
$ StandardHours : num 8 8 8 8 8 8 8 8 8 8 ...
$ TotalWorkingYears : num 1 6 5 13 9 28 5 10 10 6 ...
$ YearsAtCompany : num 1 5 5 8 6 7 0 0 9 6 ...
colSums(is.na(e1e2))
EmployeeID JobSatisfaction JobInvolvement PerformanceRating
0 0 0 0
Age Attrition BusinessTravel Department
0 0 0 0
Education Gender JobRole MaritalStatus
0 0 0 0
MonthlyIncome NumCompaniesWorked PercentSalaryHike StandardHours
0 0 0 0
TotalWorkingYears YearsAtCompany
0 0
Here we mutate a new variable which is Per Hour Rate of an employee. This is done by a simple math which is by computing the daily salary from the monthly income and we divide that by the standard working hours of each employee. Then we check for the nan value if any.
e1e2<- mutate(e1e2,PerHourRate = (e1e2$MonthlyIncome/4.33)/5/e1e2$StandardHours)
head(e1e2)
dim(e1e2)
[1] 4410 19
#colSums(is.na(e1e2))
Here we check for the outliers using the outlier package. A new dataframe is being created by selecting the variables usind dplyr package and the box plots for each variables are plotted seperately. For Age, there are no outliers and for the number of companies worked there is only one possible outlier which looks good enough.The outliers are handled by computing the z scores and a user defined function was created to cap the outliers.
e1e2sb <- e1e2 %>% dplyr::select(Age,TotalWorkingYears,YearsAtCompany,NumCompaniesWorked)
boxplot(e1e2sb$TotalWorkingYears,main="Outlier Check for Total Working Years",
xlab="Total Working Years")
#boxplot(e1e2sb$Age,main="Outlier Check for Age",xlab="Age")
boxplot(e1e2sb$YearsAtCompany,main="Outlier Check for Years At Company",xlab="Years")
boxplot(e1e2sb$NumCompaniesWorked,main="Outlier Check for Num Of Companies Worked",xlab="Numbers")
zs1 <- e1e2sb$TotalWorkingYears %>% scores(type = "z")
which(abs(zs)>3)
[1] 13 144 188 338 367 637 699 786 859 927 1044 1144 1298 1311 1339 1400 1483 1614
[19] 1658 1808 1837 2107 2169 2256 2329 2397 2514 2614 2768 2781 2809 2870 2953 3084 3128 3278
[37] 3307 3577 3639 3726 3799 3867 3984 4084 4238 4251 4279 4340
length(which(abs(zs)>3))
[1] 48
zs2 <- e1e2sb$YearsAtCompany %>% scores(type = "z")
which(abs(zs1)>3)
[1] 13 144 188 338 367 637 699 786 859 927 1044 1144 1298 1311 1339 1400 1483 1614
[19] 1658 1808 1837 2107 2169 2256 2329 2397 2514 2614 2768 2781 2809 2870 2953 3084 3128 3278
[37] 3307 3577 3639 3726 3799 3867 3984 4084 4238 4251 4279 4340
length(which(abs(zs1)>3))
[1] 48
z3 <- e1e2sb$NumCompaniesWorked %>% scores(type = "z")
which(abs(z3)>3)
integer(0)
length(which(abs(z3)>3))
[1] 0
cap <- function(x){
quantiles <- quantile( x, c(.05, 0.25, 0.75, .95 ) )
x[ x < quantiles[2] - 1.5*IQR(x) ] <- quantiles[1]
x[ x > quantiles[3] + 1.5*IQR(x) ] <- quantiles[4]
x
}
e1e2sb$TotalWorkingYears <- e1e2sb$TotalWorkingYears %>% cap()
e1e2sb$YearsAtCompany <- e1e2sb$YearsAtCompany %>% cap()
e1e2sb$NumCompaniesWorked <- e1e2sb$NumCompaniesWorked %>% cap()
boxplot(e1e2sb,main="Box Plot After Handling of Outliers")
#boxplot(e1e2sb$TotalWorkingYears,main="Outlier Check for Total Working Years",xlab="Total Working Years")
#boxplot(e1e2sb$Age,main="Outlier Check for Age",xlab="Age")
#boxplot(e1e2sb$YearsAtCompany,main="Outlier Check for Years At Company",xlab="Years")
#boxplot(e1e2sb$NumCompaniesWorked,main="Outlier Check for Num Of Companies Worked",xlab="Numbers")
Here we apply different transformation techniques to reduces the skewness and acheive a normal distribution plot. In the below transformation we can see that the Log Transformation works better than Box Cox transformation. Then we get rid of the skewness by applying Log and Cube root transformations. We can see that the skewness is getting reduced by increasing the power. Cube root transformation works well when we specify the breaks.
hist(e1e2$Age,main = "Histogram for Age")
boxage <- BoxCox(e1e2$Age,lambda = "auto")
hist(boxage,main = "Box Cox Transformation for Age")
logage <- log10(e1e2$Age)
hist(histage,main = "Log Transformation for Age")
hist(e1e2$YearsAtCompany,main = "Histogram for Years At Company")
logyatc <- log10(e1e2$YearsAtCompany)
hist(logyatc,main = "Log Transformation for Years At Company")
reciyac <- (e1e2$YearsAtCompany)^(1/3)
hist(reciyac,main="Cube Root Transformation for Years At Company")
hist(reciyac,main="Cube Root Transformation for Years At Company",breaks = 12)