The following coursebook is produced by the team at Algoritma for BRI Data Hackathon 2021 Workshop: People Analytics. The coursebook is intended for a restricted audience only, i.e. the individuals having received this coursebook directly from the training organization. It may not be reproduced, distributed, translated or adapted in any form outside these individuals and organizations without permission.
Algoritma is a data science education center based in Jakarta. We organize workshops and training programs to help working professionals and students gain mastery in various data science sub-fields: data visualization, machine learning, data modeling, statistical inference, etc.
People analytics, especially in Human Resource Management area, is revolutionising the way human resources departments operate, leading to higher efficiency and better results overall1. Human resources has been using analytics for years. However, the collection, processing and analysis of data has been largely manual, and given the nature of human resources dynamics and HR KPIs, the approach has been constraining HR. Therefore, it is surprising that HR departments woke up to the utility of machine learning so late in the game. In this opportunity, we’re going to do predictive analytics and also intepreting its ML model.
The data is taken from Kaggle. It’s a synthetic data inspired from this data which has often been used to perform people analytics for Human Resource Management students at the college. Note the data is from fictitious company but the variable and the cases are identical to those in real companies. We’re going to identifying which employees are most likely will leave the company so the HR will have enough time to do precautionary measure. Our analysis will be their first signal, so the company needs your help in doing such analysis with accurate and interpretable methods.
This workshop cover most of data science workflow to explore, predict, and interpreting classification cases in human resource area. We’ll be focus on interpret the business application rather than the math algorithm for every model.
Load this package in your R. If you don’t have any, please run the chunk below
<- c("tidyverse","lubridate","ggplot2","plotly","FactoMineR","factoextra",
packages "tidytext","rsample","caret","tidymodels","ranger","lime")
install.packages(packages)
library(tidyverse) # Data Wrangling
library(lubridate) # Date type data
library(ggplot2) # Data Visualization
library(plotly) # Interactive data visualization
library(FactoMineR) # PCA
library(factoextra) # PCA Visualization
library(tidytext) # additional data wrangling
library(rsample) # data sampling
library(caret) # ML modeling
library(tidymodels) # ML modeling
library(ranger) # Random Forest
library(xgboost) # XGBoost
library(lime) # ML model intepretation
Our data stored in .txt format. R provides lots of method to import various kind of data format like csv, xlsx, or json. R also have packages that allows importing from various type of databases2. In our case, the data in txt are separated by ‘tab’. Sometimes is a good idea to look at your data by manually open it to see how it structured so we can import it easily. I use read.table()
function to import the data (though its not the only function to import txt) with “tab” as custom separator, header TRUE to tells the function that the data already has column name and put NA to empty cells with fill parameter.
<- read.table("data_input/HR DATA.txt",sep = "\t",header = T,fill = T)
hrd head(hrd)
Here some data dictionary i collected from the original sources:
Employee Name
: Employee’s full nameEmpID
: Employee ID is unique to each employeeMarriedID
: Is the person married (1 or 0 for yes or no)MaritalStatusID
: Marital status code that matches the text field MaritalDescEmpStatusID
: Employment status code that matches text field EmploymentStatusDeptID
: Department ID code that matches the department the employee works inPerfScoreID
: Performance Score code that matches the employee’s most recent performance scoreFromDiversityJobFairID
: Was the employee sourced from the Diversity job fair? 1 or 0 for yes or noPayRate
: The person’s hourly pay rate. All salaries are converted to hourly pay rateTermd
: Has this employee been terminated - 1 or 0PositionID
: An integer indicating the person’s positionPosition
: The text name/title of the position the person hasState
: The state that the person lives inZip
: The zip code for the employeeDOB
: Date of Birth for the employeeSex
: Sex - M or FMaritalDesc
: The marital status of the person (divorced, single, widowed, separated, etc)CitizenDesc
: Label for whether the person is a Citizen or Eligible NonCitizenHispanicLatino
: Yes or No field for whether the employee is Hispanic/LatinoRaceDesc
: Description/text of the race the person identifies withDateofHire
: Date the person was hiredDateofTermination
: Date the person was terminated, only populated if, in fact, Termd = 1TermReason
: A text reason / description for why the person was terminatedEmploymentStatus
: A description/category of the person’s employment status. Anyone currently working full time = ActiveDepartment
: Name of the department that the person works inManagerName
: The name of the person’s immediate managerManagerID
: A unique identifier for each manager.RecruitmentSource
: The name of the recruitment source where the employee was recruited fromPerformanceScore
: Performance Score text/category (Fully Meets, Partially Meets, PIP, Exceeds)EngagementSurvey
: Results from the last engagement survey, managed by our external partnerEmpSatisfaction
: A basic satisfaction score between 1 and 5, as reported on a recent employee satisfaction surveySpecialProjectsCount
: The number of special projects that the employee worked on during the last 6 monthsLastPerformanceReviewDate
: The most recent date of the person’s last performance review.DaysLateLast30
: The number of times that the employee was late to work during the last 30 daysOriginal.DS
: Whether the observation is from original dataset or syntheticThe data is undoubtedly messy. It has redundant information, wrong data type, and posibbly the presence of missing value or outlier. The most common step in data wrangling is to change data type and remove redundant variables. For example, we have Employee_Name
and EmpID
which tell us a same information. I prefer to use Employee_Name
to identify our rows. Thus we need to remove the EmpID
column. But before that we need to make sure every value in Employee_Name
in unique.
duplicated(hrd$Employee_Name),] hrd[
%>%
hrd filter(Employee_Name %in% hrd[duplicated(hrd$Employee_Name),"Employee_Name"])
From the chunk above we know that there’s some duplicated employee name. Let’s add “2” in the second duplicated name to make it unique
duplicated(hrd$Employee_Name),]$Employee_Name <- paste(hrd[duplicated(hrd$Employee_Name),]$Employee_Name,"2",sep = "")
hrd[
duplicated(hrd$Employee_Name),] hrd[
Now there’s no duplicated name. We also know that there are redundant variables like MarriedID
and MaritalDesc
or DeptID
and Department
. For me is better to keep the description column so we need to remove the id one.
<- hrd %>%
hrd select(-c(EmpID,Zip,MarriedID,EmpID,MaritalStatusID,GenderID,EmpStatusID,
DeptID,PerfScoreID,ManagerID,Original.DS,PositionID,DaysLateLast30))
head(hrd)
Next, let’s change the data type. notice that almost every character variable is actually factor and some numeric like FromDiversityJobFairID
and Termd
is also a factor. Date type like DOB
and DateofHire
is listed as character. We need to change it all
<- hrd %>%
hrd mutate(DOB = dmy(DOB),
DateofHire = dmy(DateofHire),
DateofTermination = dmy(DateofTermination),
LastPerformanceReview_Date = dmy(LastPerformanceReview_Date),
FromDiversityJobFairID = as.factor(FromDiversityJobFairID),
Termd = as.factor(Termd)) %>%
mutate_if(is.character, as.factor) %>%
mutate(Employee_Name = as.character(Employee_Name))
head(hrd)
Everything looks fine except DOB
. That column shows employees’ date of birth so its impossible that someone was born above 2021. dmy()
function assume year “64” is 2064 instead 1964. Sadly there’s no parameter or option to change it to our desire, we need to fix it in the hard way.
That’s one of the challenge in data cleaning process Sometimes we need to make our hand dirty by using weird code or workaround. In our case, first, i will separate the date component (year, month, day) by its separator then add “19” to the year (example: 64 -> 1964) then recombined everything and return it as date type
# DOB in our data has become Date type, we can't change it to character directly.
# We need to re-read the data and take that column only
<- read.table("data_input/HR DATA.txt",sep = "\t",header = T,fill = T)
dob <- strsplit(dob$DOB,split = "-")
dob
<- data.frame(matrix(unlist(dob), nrow=length(dob), byrow=T)) %>%
dob setNames(c("day","month","year")) %>%
mutate(year_new = paste("19",year,sep = ""))
head(dob)
Combine the date component and return it to date type
<- hrd %>%
hrd mutate(DOB = paste(dob$day,dob$month,dob$year_new,sep = "-"),
DOB = dmy(DOB))
head(hrd)
The data type has changed into the correct format. next lets see if our data have missing value
colSums(is.na(hrd))
## Employee_Name FromDiversityJobFairID
## 0 0
## PayRate Termd
## 0 0
## Position State
## 0 0
## DOB Sex
## 0 0
## MaritalDesc CitizenDesc
## 0 0
## HispanicLatino RaceDesc
## 0 0
## DateofHire DateofTermination
## 0 2351
## TermReason EmploymentStatus
## 0 0
## Department ManagerName
## 0 0
## RecruitmentSource PerformanceScore
## 0 0
## EngagementSurvey EmpSatisfaction
## 78 78
## SpecialProjectsCount LastPerformanceReview_Date
## 78 3103
There are lots of way to deal with missing value. we can either remove it or impute with some logical value about the data. That’s why is important to ‘know’ the data first before we do some analysis. We know that our data is about employee resignation from the company. It is make sense if DateofTermination
has lots of missing value because the folowing observation may still working in the company. LastPerformanceReview_Date
column shows when the last time the employee performance was reviewed. Maybe some employee is never been reviewed. However we need to fill the NA. let’s fill the NA in both column with the day of data was taken. Why? let’s keep going, i promise it will make sense later.
<- as.Date("2020-01-01")
recent
<- hrd %>%
hrd mutate(DateofTermination = replace_na(hrd$DateofTermination,recent),
LastPerformanceReview_Date = replace_na(hrd$LastPerformanceReview_Date,recent))
head(hrd)
It’s clean now, but we still have some missing value
colSums(is.na(hrd))
## Employee_Name FromDiversityJobFairID
## 0 0
## PayRate Termd
## 0 0
## Position State
## 0 0
## DOB Sex
## 0 0
## MaritalDesc CitizenDesc
## 0 0
## HispanicLatino RaceDesc
## 0 0
## DateofHire DateofTermination
## 0 0
## TermReason EmploymentStatus
## 0 0
## Department ManagerName
## 0 0
## RecruitmentSource PerformanceScore
## 0 0
## EngagementSurvey EmpSatisfaction
## 78 78
## SpecialProjectsCount LastPerformanceReview_Date
## 78 0
After some examination, i found that employee who has missing value in EngagementSurvey
is also missing in EmpSatisfaction
or SpecialProjectsCount
and vice versa.
%>%
hrd filter(is.na(EngagementSurvey))
In addition, there’re only 78 rows from 3310 are missing. I think its safe for us to remove the corresponding row instead of fill random numbers and wasted our time.
<- hrd %>%
hrd_clean filter(!is.na(EngagementSurvey))
colSums(is.na(hrd_clean))
## Employee_Name FromDiversityJobFairID
## 0 0
## PayRate Termd
## 0 0
## Position State
## 0 0
## DOB Sex
## 0 0
## MaritalDesc CitizenDesc
## 0 0
## HispanicLatino RaceDesc
## 0 0
## DateofHire DateofTermination
## 0 0
## TermReason EmploymentStatus
## 0 0
## Department ManagerName
## 0 0
## RecruitmentSource PerformanceScore
## 0 0
## EngagementSurvey EmpSatisfaction
## 0 0
## SpecialProjectsCount LastPerformanceReview_Date
## 0 0
Now our data is clean but maybe we need to do more cleaning or feature engineering in next step, the data cleaning process is not done yet. In my point of view, i think we can do some simple feature engineering that maybe will improve our prediction model. See, date type is often useless if we put it to our model. We can turn in into something more logical for example, from our data we can make new column that shows in what age the employee were recruited by taking the difference between DOB
and DateofHire
. We also can make how many day is passed from the last time employee was reviewd by subtracting recent date with LastPerformanceReview_Date
.
<- hrd_clean %>%
hrd_clean mutate(age = as.numeric(round((DateofHire - DOB)/365)),
DayAfterReview = as.numeric(round(recent - LastPerformanceReview_Date)),
WorkingDay = ifelse(Termd == 0,
as.numeric(round(recent - DateofHire)),
as.numeric(round(DateofTermination - DateofHire))))
head(hrd_clean)
We can use the new column to sharpen our analysis. Remember when i say we fill NA in LastPerformanceReview_Date
by recent date? If we substract the date to get DayAfterReview
, now we have “0”. Thus we can assume that employee with “0” value in DayAfterReview
column was never been reviewed. Notice that i also make a difference when creating WorkingDay
column. To calculate how many day the employee been working for the company, For those who still active, i substract the day they get hired with recent day and for those who reitre or resign, i substract by with DateofTermination
.
You may have an idea to do more feature engineering. Don’t be afraid to do it, It may improve our analysis and prediction. Feel free to tell me if you have one!
Exploratory Data Analysis (EDA) is an important step that often forgotten. It isn’t wise to put all variables as predictor directly by just holding one statistics test, like correlation, as justification to determine that the variables is good for predicting3.We can do EDA to know more about our data and make the first hypothesis for our prediction objective. There are lot of things to do in EDA step, data visualization is one of the most important4. FYI, there is even job specialized to do this step, namely: “Data Analyst” or “Business Intelligence”. Because there’s too many stuff we can do, i’ll provide some interesting question and we will try to answer it by wrangling the data.
Interesting question:
We can easily find outlier by drawing boxplot. Outlier is an extreme data that, statistically, is more or less than 1.5 IQR. Lets draw boxplot for every numeric variables
%>%
hrd_clean select_if(is.numeric) %>%
pivot_longer(cols = c(names(.))) %>%
ggplot(aes(y = value)) +
geom_boxplot(aes(fill = name),show.legend = F) +
scale_fill_brewer(palette = "RdGy") +
facet_wrap(~name,scales = "free_y") +
theme_minimal()
From the plot above we know that
age
, DayAfterReview
,SpecialProjectsCount
and WorkingDay
have some outlier. We also can find that using biplot. Biplot help us visualize lots of numeric variables into two dimensions. Each axis are represented by PC. PC or Principal Componets is the summarization of every scaled numeric variables. Each variabels has different contribution to each PC. In short, PCA help us to reduce dimensionality by summarise numeric variables.
Let’s create the PCA
<- which(sapply(hrd_clean, is.factor))
factor
<- PCA(hrd_clean %>% `rownames<-`(hrd_clean$Employee_Name),
pca quali.sup = c(1,7,13,14,24,26,factor),scale.unit = T,graph = F)
summary(pca)
##
## Call:
## PCA(X = hrd_clean %>% `rownames<-`(hrd_clean$Employee_Name),
## scale.unit = T, quali.sup = c(1, 7, 13, 14, 24, 26, factor),
## graph = F)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6
## Variance 1.500 1.144 1.007 0.934 0.921 0.494
## % of var. 25.006 19.066 16.783 15.568 15.350 8.226
## Cumulative % of var. 25.006 44.072 60.856 76.424 91.774 100.000
##
## Individuals (the 10 first)
## Dist Dim.1 ctr cos2 Dim.2 ctr cos2
## Gonzalez, Maria | 2.001 | 1.530 0.048 0.585 | -0.279 0.002 0.019 |
## Cockel, James | 2.318 | -1.321 0.036 0.325 | 1.157 0.036 0.249 |
## Bunbury, Jessica | 2.737 | 0.729 0.011 0.071 | -0.331 0.003 0.015 |
## Buck, Edward | 2.116 | 0.551 0.006 0.068 | 0.469 0.006 0.049 |
## Jacobi, Hannah | 2.444 | -0.947 0.019 0.150 | 0.239 0.002 0.010 |
## Riordan, Michael | 3.745 | 0.346 0.002 0.009 | 0.376 0.004 0.010 |
## Ferguson, Susan | 3.199 | -0.436 0.004 0.019 | -2.586 0.181 0.653 |
## Albert, Michael | 2.365 | 0.659 0.009 0.078 | -0.770 0.016 0.106 |
## Gentry, Mildred | 1.850 | -0.815 0.014 0.194 | -0.483 0.006 0.068 |
## Onque, Jasmine | 2.448 | 0.752 0.012 0.094 | 0.007 0.000 0.000 |
## Dim.3 ctr cos2
## Gonzalez, Maria 0.325 0.003 0.026 |
## Cockel, James -1.309 0.053 0.319 |
## Bunbury, Jessica -1.634 0.082 0.357 |
## Buck, Edward -0.688 0.015 0.106 |
## Jacobi, Hannah -1.521 0.071 0.387 |
## Riordan, Michael 2.765 0.235 0.545 |
## Ferguson, Susan -1.460 0.065 0.208 |
## Albert, Michael 1.095 0.037 0.215 |
## Gentry, Mildred 0.525 0.008 0.081 |
## Onque, Jasmine 1.438 0.064 0.345 |
##
## Variables
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3
## PayRate | 0.841 47.177 0.708 | 0.145 1.825 0.021 | 0.089
## EngagementSurvey | 0.043 0.125 0.002 | -0.495 21.377 0.245 | 0.585
## EmpSatisfaction | -0.131 1.144 0.017 | 0.569 28.278 0.323 | -0.354
## SpecialProjectsCount | 0.859 49.175 0.738 | 0.113 1.107 0.013 | -0.010
## age | 0.079 0.413 0.006 | -0.589 30.335 0.347 | -0.323
## WorkingDay | -0.172 1.966 0.029 | 0.442 17.078 0.195 | 0.653
## ctr cos2
## PayRate 0.785 0.008 |
## EngagementSurvey 33.985 0.342 |
## EmpSatisfaction 12.475 0.126 |
## SpecialProjectsCount 0.011 0.000 |
## age 10.388 0.105 |
## WorkingDay 42.356 0.427 |
##
## Supplementary categories (the 10 first)
## Dist Dim.1 cos2 v.test Dim.2 cos2 v.test
## Abbott, Harley | 2.345 | -1.194 0.259 -0.974 | 0.566 0.058 0.529 |
## Abbott, Kendal | 2.313 | -0.718 0.096 -0.586 | -1.410 0.372 -1.318 |
## Abbott, Marvin | 2.177 | -1.266 0.338 -1.034 | 1.719 0.623 1.607 |
## Acevedo, Ciara | 1.828 | 1.260 0.475 1.028 | 0.277 0.023 0.259 |
## Acosta, Jaylen | 1.315 | -1.041 0.627 -0.850 | -0.206 0.024 -0.192 |
## Acosta, Michael | 2.497 | -0.947 0.144 -0.773 | -0.711 0.081 -0.664 |
## Adams, Gabriel | 2.337 | -0.726 0.096 -0.592 | -1.388 0.353 -1.298 |
## Adams, Karma | 0.944 | -0.823 0.760 -0.672 | 0.379 0.161 0.354 |
## Adinolfi, Wilson K | 2.138 | -1.244 0.339 -1.016 | 1.656 0.600 1.548 |
## Adkins, Jonathan | 2.231 | 0.562 0.063 0.459 | 0.148 0.004 0.139 |
## Dim.3 cos2 v.test
## Abbott, Harley -0.917 0.153 -0.914 |
## Abbott, Kendal 1.065 0.212 1.061 |
## Abbott, Marvin -0.239 0.012 -0.238 |
## Acevedo, Ciara 0.058 0.001 0.057 |
## Acosta, Jaylen 0.550 0.175 0.548 |
## Acosta, Michael -0.705 0.080 -0.703 |
## Adams, Gabriel 1.100 0.221 1.096 |
## Adams, Karma 0.112 0.014 0.112 |
## Adinolfi, Wilson K -0.030 0.000 -0.030 |
## Adkins, Jonathan -0.223 0.010 -0.222 |
By only using 3 PC, we already covers 60% variance of actual data. Let’s draw some biplot
options(ggrepel.max.overlaps = Inf)
plot.PCA(pca,choix = "ind",invisible = "quali",select = "contrib10",habillage = 4)
Even tough its kinda messy, we found 10 employee name that are considered as outliers. But remember this plot only represent approximately 44% variance of the actual data. In which variabel those employee are considered as outlier? let’s see how the variables contribute to each PC
plot.PCA(pca,choix = "var")
PC 1 or Dimension 1 is mostly contributed by PayRate
and SpecialProjectsCount
. Dim 2 is contributed by the other. From both plot, if the observation is placed to the most right, they have high PayRate
and SpecialProjectsCount
. If they placed to the most bottom, they have high EngagementSurvey
and age
but low EmpSatisfaction
and WorkingDay
. The 10 outliers is maybe has high value in SpecialProjectsCount
, age,
and WorkingDay
. This statement is just the same as what we found using boxplot in previous result.
fviz_pca_biplot(pca,habillage = 4,invisible = "quali",label = "var",)
There are lots of information we can take from the biplot above. But take it with a grain of salt because its only summarize numeric value and only covered 44% of variance. Here some insight we can take:
PayRate
and SpecialProjectsCount.
PC2 by EmpSatisfaction
, WorkingDay
, age
, and EngagementSurvey
PayRate
and SpecialProjectsCount
has high positive correlationEngagementSurvey
and EmpSatisfaction
has high negative correlationPayRate
and SpecialProjectsCount
to PC1age
and EmpSatisfaction
has more contribution to PC2 than WorkingDay
and EngagementSurvey
PayRate
and SpecialProjectsCount
may affect the employee attritionEmployee’ manager and their performance score are portrayed in ManagerName
and PerformanceScore
column. We can’t calculate correlation from categorical data but we can know if the data is dependent to each other by use statistical test named chi-square test of independence. This can be done using chisq.test()
function.
chisq.test(hrd_clean$ManagerName,hrd_clean$PerformanceScore)
## Warning in chisq.test(hrd_clean$ManagerName, hrd_clean$PerformanceScore): Chi-
## squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: hrd_clean$ManagerName and hrd_clean$PerformanceScore
## X-squared = 826.79, df = 60, p-value < 0.00000000000000022
To interpret the output, we first need to determine the null and alternative hypothesis
H0 = there is no relationship between Manager and employee performance score
H1 = There is a relationship bertween manager and employee performance score
From the output we know that p-value is less than the alpha (we use the most common alpha, 0.95), thus we need to reject the null hypthesis (H0) and we conclude that There is a significant relationship bertween manager and employee perfornamce score. We know that both column is associated to each other.
We already answer the question, should we stop here? No. EDA is the space of data visualization in data science workflow. Draw your best visualization here to inform people interesting things about the data
# i found out there are empty factor level in some columns. we need to remove that level
<- hrd_clean %>%
hrd_clean droplevels()
table(hrd_clean$ManagerName,hrd_clean$PerformanceScore) %>%
as.data.frame() %>%
mutate(Var1 = reorder_within(Var1,by = Freq,within = Var2)) %>%
ggplot(aes(x = Freq, y = Var1)) +
geom_col(aes(fill = Var2)) +
scale_fill_brewer(palette = "RdGy") +
facet_wrap(~Var2,scales = "free") +
scale_y_reordered() +
labs(title = "Employee Performance by Manager",
subtitle = "Who have the Best Employee?",
x = "Frequency", y = "Manager Name") +
theme_minimal() +
theme(legend.position = "none")
The plot above show frequency of employee that have specific performance score by each manager. “Exceeds” is the best level performance and “PIP” is the lowest. We know that Brannon Miller is the best manager because he has the highest number of great employee, in contrast, John Smith is the worst.
But is this plot enough to tell what is happening between employee and its manager relation to their performance score? I think its not. Maybe Brannon Miller has so many employee that if we calculate by frequency he looks as the best.
# showing the number of rows per manager
%>%
hrd_clean group_by(ManagerName) %>%
count() %>%
arrange(-n)
Is now clear that Brannon Miller has lot of employee under its name compared to most manager. Let’s wrangle the data to see the percentage of best employee by the total of employee from each manager.
<- table(hrd_clean$ManagerName,hrd_clean$PerformanceScore) %>%
manager_perf as.data.frame() %>%
pivot_wider(names_from = "Var2",values_from = "Freq") %>%
mutate(perc_exceed = round( Exceeds / (Exceeds + `Fully Meets` + `Needs Improvement` + PIP),3),
perc_pip = round(PIP/ (Exceeds + `Fully Meets` + `Needs Improvement` + PIP),3))
manager_perf
The table above is the result of our aggregation. Now we have the information about each manager’ employee performance and its percentage. Let’s draw a plot about it
%>%
manager_perf select(c(Var1,Exceeds,perc_exceed)) %>%
pivot_longer(cols = c(Exceeds,perc_exceed)) %>%
ggplot(aes(x = value, y = reorder(Var1,value))) +
geom_col(aes(fill = name),show.legend = F) +
scale_fill_manual(values = c("#bf0808","#3b3b3b")) +
facet_wrap(~name,scales = "free_x") +
labs(title = "Best Employee Performance by Manager",
subtitle = "Frequency and Percentage",
x = "Value", y = "Manager Name") +
theme_minimal()
Now is true that Brannon Miller is the best Manager because not only by frequency of high performanced employee, but also by percentage. More than 30% of his employee has high performance. But we also know that Amy Dunn, the second best, is not actually the second best. Alex Sweetwater is better than her by percentage of their employee. With this type of information we can draw a correct conclussion because the calculation is balanced. Is important to share information that is the most true, rather than we ‘want’.
%>%
manager_perf select(c(Var1,PIP,perc_pip)) %>%
pivot_longer(cols = c(PIP,perc_pip)) %>%
ggplot(aes(x = value, y = reorder(Var1,value))) +
geom_col(aes(fill = name),show.legend = F) +
scale_fill_manual(values = c("#bf0808","#3b3b3b")) +
facet_wrap(~name,scales = "free_x") +
labs(title = "Worst Employee Performance by Manager",
subtitle = "Frequency and Percentage",
x = "Value", y = "Manager Name") +
theme_minimal()
Well its true that John Smith has the lowest performance employee, both by frequency or percentage. Kissy sulivan however is the second best in both good and bad performance employee. Based on this founding the HR team can develop some action to improve the employe performance but also considering the manager influence.
%>%
hrd_clean group_by(Department) %>%
summarise(avg_PayRate = mean(PayRate),
median_PayRate = median(PayRate),
min_PayRate = min(PayRate),
max_PayRate = max(PayRate)) %>%
arrange(-median_PayRate)
From the aggregation result we know Production
department has the lowest average and median compated to other department, the median is not even a half of 4 highest payrate median. We know every department have different workload and there are always other factors that affect the salary. So we can’t directly say that there is an equality in payRate
in different departments. We can get better insight if we analyze the payRate
by each department
%>%
hrd_clean filter(Termd == 0) %>%
select(PayRate,Department,WorkingDay) %>%
ggplot(aes(x = WorkingDay, y = PayRate)) +
geom_point(aes(col = WorkingDay),show.legend = F) +
scale_color_continuous(low = "#3b3b3b",high = "#bf0808") +
facet_wrap(~Department) +
labs(title = "PayRate and WorkingDay Correlation",
subtitle = "by each department") +
theme_minimal()
Its a common knowledge that the longer employee work in same company, the higher the salary. But from the plot above, for every department, there is no correlation between WorkingDay
and PayRate
.
::ggcorr(hrd_clean %>%
GGallyselect_if(is.numeric),label = T,legend.position = "none")
Relationship from every numeric variable are also has very low correlation with PayRate
except SpecialProjectsCount
. Correlation betweet SpecialProjectsCount
and payRate
is 0.5 means if the SpecialProjectsCount
get higher, the payRate
is also get higher. We can say that there is no correlation between payRate
and Department
and the payRate
maybe influenced by the number of SpecialProjectsCount
.
cor.test(hrd_clean$PayRate,hrd_clean$SpecialProjectsCount)
##
## Pearson's product-moment correlation
##
## data: hrd_clean$PayRate and hrd_clean$SpecialProjectsCount
## t = 31.591, df = 3230, p-value < 0.00000000000000022
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4590573 0.5117513
## sample estimates:
## cor
## 0.4858456
The cor.test()
function is a statistical method to not only calculate the correlation, but also determine if it the variabel is significanly correlated to each other. The p-value is below the alpha (0.05), same like chi-square test, we failed to reject the null hypothesis. Thus we know that the variable is significanly correlated to each other and worth to analyze further. Let’s see which department has the most SpecialProjectsCount
.
%>%
hrd_clean group_by(Department) %>%
summarise(total = sum(SpecialProjectsCount),
avg = mean(SpecialProjectsCount)) %>%
arrange(-avg)
IT/IS department has the most SpecialProjectsCount
, every employee in that department has 5 SpecialProjectsCount
in average. We can assume that people in IT/IS department is most likely get higher payRate
because they receive more Special Projects. Sales, however, has low average of special projects but they have the highest of payRate
median.
We’ve done a lot of EDA. Now it’s time to build prediction model. In this part we will build some model with different algorithm like Decision Tree, Random Forest, and XGBoost and pick one of the best. Even though we will be using lots of algorithm, the code itself is not complicated. Tidymodels
package will help us to build model with same workflow. We only need to change 1 or 2 function based on the model.
Our data has a lot of variables, 27 in total. Not all variabels are important to use in modeling step. We can remove some by identifying which variables has low variance, or subjectively select the variables
<- nearZeroVar(hrd_clean)
zero_var zero_var
## [1] 6 10 14 24 26
nearZeroVar()
function return column index that has low (or near to zero) variance. Low variance data is bad to our model because there is not much information the model can learn. Next we use the index to subset to our data
<- hrd_clean[,-zero_var]
hrd_mod head(hrd_mod)
Actually, our data still has some redundant information. For example, TermReason
describe why employees are resign, same as EmploymentStatus
. We will remove that kind of variables and also date and id variables
<- hrd_mod %>%
hrd_mod select(-c(Employee_Name,DOB,DateofHire,TermReason,EmploymentStatus,WorkingDay))
head(hrd_mod)
Before modeling is important to split the data into train and test data. The train data will be used for train the model and test data to evaluate our model. We also need to check if the target variable is balanced.
prop.table(table(hrd_mod$Termd))
##
## 0 1
## 0.6704827 0.3295173
The data is not balance. After we split the data, we need to balance it using downsample or upsample
set.seed(123)
<- initial_split(hrd_mod,prop = 0.8,strata = "Termd")
splitter <- training(splitter)
train <- testing(splitter) test
prop.table(table(train$Termd))
##
## 0 1
## 0.6702744 0.3297256
upsampling will duplicate the minority until the data is balanced, in contrast, downsampling will remove the majority. Since the data is not that extreme unbalanced, i’ll use downsampling.
<- downSample(train %>% select(-Termd), train$Termd)
train_down # downSample() function rename Termd into Class.
# lets rename it again and check the class probability
<- train_down %>%
train_down rename("Termd" = "Class") %>%
droplevels()
prop.table(table(train_down$Termd))
##
## 0 1
## 0.5 0.5
Now the train data is balanced and we’re ready to do modeling. Since most ouf our data is categorical, i’ll focus on building tree based model. i also will use default parameter when building the model. Parameter tuning deserve its own part
library(partykit)
#mod_dt <- ctree(Termd ~., data = train_down)
# it will take times everytime we fit the data
# i save the trained model and we only need to load it
<- readRDS("model/mod_dt.rds")
mod_dt
# predict to test data
<- predict(mod_dt,test,type = "prob")
pred_dt head(data.frame(pred_dt))
Notice that i set the output predict as probability rather than class. We can always change the threshold to decide which class the prediction results in. For now, we will use 0.5 as the common threshold standard
<- as.factor(ifelse(pred_dt[,1] > 0.5,0,1))
pred_dt_class <- confusionMatrix(pred_dt_class,test$Termd,positive = "0")
conf_dt conf_dt
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 277 47
## 1 156 165
##
## Accuracy : 0.6853
## 95% CI : (0.6479, 0.721)
## No Information Rate : 0.6713
## P-Value [Acc > NIR] : 0.2388
##
## Kappa : 0.3695
##
## Mcnemar's Test P-Value : 0.00000000000003453
##
## Sensitivity : 0.6397
## Specificity : 0.7783
## Pos Pred Value : 0.8549
## Neg Pred Value : 0.5140
## Prevalence : 0.6713
## Detection Rate : 0.4295
## Detection Prevalence : 0.5023
## Balanced Accuracy : 0.7090
##
## 'Positive' Class : 0
##
Confusion matrix show us a lot of stuff. In the most common way to decide if the model is good enough is to look at the accuracy. we have 68.5% accuracy, is it good enough? Is accuracy the only important metrics in classification? To understand this lets take a short break to learn about confusion matrix
Confusion matrix gives you the difference of classification results by the actual value. There are some metrics that can be calculated from confusion matrix with each of them has its own interpretation and can be used differently depending on the case5
If 1 is the positive value Confusion matrix consist in 4 parts: - True Positive (TP): When we predict 1, the actual value is 1 - True Negative (TN): When we predict 0, the actual value is 0 - False Postive (FP): When we predict 1, the actual value is 0 - False Negative (FN): When we predict 0, the actual value is 1
Recall and precision is most common metrics beside accuracy that take FP or FN into account. They consider the weight of FP or FN and have interesting intepretation depends on the case
The image above show some example of the case. In our case, is important to know which employee is likely to resign as far as we can so the HR department can do some action to prevent that. The cost if employees are resign is higher than if we do some precautionary action. Thus, it is okay if we do the action to employee who doesn’t want to resign but predicted to resign rather than actually missed someone who will resign before we do the action. In confusion matrix terms, it simply means we need to avoid high False Negative (when employees are predicted to not resign, but actually does). In summary, we want high Recall value
cmplot()
for classification thresholdOne of Algoritma mentor have created a package to create confusion matrix graph for Optimizing Probability Thresholds. Install the package from github then run the confmat_plot()
function
::install_github("ahmadhusain/cmplot") devtools
::confmat_plot(prob = pred_dt[,1],
cmplotref = test$Termd,
postarget = "0",
negtarget = "1")
Now we will focus on recall metrics. Our previous decision tree model has 85.4% recall (sensitivity). In the next model we want to look at higher recall.
# mod_rf <- rand_forest(mode = "classification") %>%
# set_engine("ranger") %>%
# fit(Termd ~., data = train_down)
<- readRDS("model/mod_rf.rds") mod_rf
<- predict(mod_rf,test,type = "prob")
pred_rf <- as.factor(ifelse(pred_rf$.pred_0 >0.5, 0,1))
pred_rf_class
<- confusionMatrix(pred_rf_class,test$Termd,positive = "0")
conf_rf conf_rf
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 370 19
## 1 63 193
##
## Accuracy : 0.8729
## 95% CI : (0.8447, 0.8976)
## No Information Rate : 0.6713
## P-Value [Acc > NIR] : < 0.00000000000000022
##
## Kappa : 0.7264
##
## Mcnemar's Test P-Value : 0.000002049
##
## Sensitivity : 0.8545
## Specificity : 0.9104
## Pos Pred Value : 0.9512
## Neg Pred Value : 0.7539
## Prevalence : 0.6713
## Detection Rate : 0.5736
## Detection Prevalence : 0.6031
## Balanced Accuracy : 0.8824
##
## 'Positive' Class : 0
##
Random Forest recall is better than decision tree. It have 85.4% recall and 87.2% accuracy. We can improve this later in model tuning part. For now lets try another algorithm
# mod_xgb <- boost_tree(mode = "classification") %>%
# set_engine("xgboost") %>%
# fit(Termd~., data = train_down)
<- readRDS("model/mod_xgb.rds") mod_xgb
<- predict(mod_xgb,test,type = "prob")
pred_xgb <- as.factor(ifelse(pred_xgb$.pred_0 > 0.5, 0,1))
pred_xgb_class <- confusionMatrix(pred_xgb_class,test$Termd,positive = "0")
conf_xgb conf_xgb
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 335 28
## 1 98 184
##
## Accuracy : 0.8047
## 95% CI : (0.7719, 0.8346)
## No Information Rate : 0.6713
## P-Value [Acc > NIR] : 0.00000000000003367
##
## Kappa : 0.5917
##
## Mcnemar's Test P-Value : 0.00000000078957868
##
## Sensitivity : 0.7737
## Specificity : 0.8679
## Pos Pred Value : 0.9229
## Neg Pred Value : 0.6525
## Prevalence : 0.6713
## Detection Rate : 0.5194
## Detection Prevalence : 0.5628
## Balanced Accuracy : 0.8208
##
## 'Positive' Class : 0
##
Even tough XGboost is the evolution of Random Forest, apparently, the result is not better than random forest. XGBoost has 80% accuracy and 77.3% recall.
lets draw a table and conclude our modeling result
data.frame(accuracy = c(conf_dt$overall[1],conf_rf$overall[1],conf_xgb$overall[1]),
recall = c(conf_dt$byClass[1],conf_rf$byClass[1],conf_xgb$byClass[1])) %>%
`rownames<-`(c("Decision Tree","Random Forest","XGBoost")) %>%
arrange(-recall)
The model evaluation is quite good already. But we can always maximize the result by doing hyperparameter tuning. In short, hyperparameter tuning is choosing a set of optimal hyperparameter for a learning algorithm. We will provide set of value to every model’ parameter and fit the model. We will take parameter value from the best fitting model. Since Decision Tree result is way lower than Random Forest or XGBoost, we will not tune the model and focusing only to the best 2.
set k-fold cross validation
set.seed(123)
<- vfold_cv(train_down,5,strata = "Termd") folds
we will tune the mtry
and trees
parameter using grid tuning. this process will take a lot of time because it will train all possible parameter combination and extract the best result. we will aim the highest recall and ROC as our best model.
# trees and mtry grid combination
<- expand.grid(trees = seq(500,700,50), mtry = 4:8)
rf.grid
# model setup. random forest using ranger engine where the trees and mtry
# will be changed by its grid
<- rand_forest(trees = tune(), mtry = tune()) %>%
rf.setup set_engine("ranger",importance = "impurity") %>%
set_mode("classification")
# formula workflow
<- workflow() %>%
rf.wf add_model(rf.setup) %>%
add_recipe(recipe(Termd ~.,data = train_down,skip = T))
# fit the data to model workflow
# rf.tune <- tune_grid(rf.wf,resamples = folds,grid = rf.grid,
# metrics = metric_set(accuracy,recall,roc_auc))
<- readRDS("model/rf_tune.rds")
rf.tune show_best(rf.tune,metric = "recall")
random forest with 8 mtry and 500 tree is the best model based on recall. we will use it as our main random forest model
<- rf.wf %>%
rf_best finalize_workflow(select_best(rf.tune,"recall")) %>%
fit(train_down)
<- predict(rf_best,test,type = "prob")
rf_pred_2 <- as.factor(ifelse(rf_pred_2$.pred_0 > 0.5,0,1))
rf_pred_class2
<- confusionMatrix(rf_pred_class2,test$Termd)
conf_rf_2 conf_rf_2
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 389 11
## 1 44 201
##
## Accuracy : 0.9147
## 95% CI : (0.8905, 0.9351)
## No Information Rate : 0.6713
## P-Value [Acc > NIR] : < 0.00000000000000022
##
## Kappa : 0.8142
##
## Mcnemar's Test P-Value : 0.00001597
##
## Sensitivity : 0.8984
## Specificity : 0.9481
## Pos Pred Value : 0.9725
## Neg Pred Value : 0.8204
## Prevalence : 0.6713
## Detection Rate : 0.6031
## Detection Prevalence : 0.6202
## Balanced Accuracy : 0.9232
##
## 'Positive' Class : 0
##
Voila! now we have recal 89.6% recall and 91% accuracy, its better than default Random Forest model we make earlier. Previously, i said that confusion matrix is just a game of deciding treshold. If we want to lower FN, the FP will raise. There is another metrics that calculate the goodnes of model for every classification treshold, its called Receiver Operating Characteristic (ROC) and Area Under ROC Curve (AUC).
cbind(rf_pred_2,rf_pred_class2,test$Termd) %>%
setNames(c("prob_0","prob_1","pred_class","truth")) %>%
roc_curve(truth,prob_0) %>%
autoplot()
The plot above shows ROC. The ROC itself doesn’t have any value, its just portayed the tradeoff between sensitivity and 1-specificity if we change the treshold. A good ROC give curves closer to the top-left corner.
cbind(rf_pred_2,rf_pred_class2,test$Termd) %>%
setNames(c("prob_0","prob_1","pred_class","truth")) %>%
roc_auc(truth, prob_0)
We have 0.98 AUC, that’s great! it means the ROC curve covers 98% area in the whole plot. The higher it is, the better. Next, lets do another model tuning for XGBoost
<- expand.grid(trees = seq(500,700,50), mtry = 4:8, learn_rate = c(0.1,0.01,0.005))
xg.grid
<- boost_tree(trees = tune(), mtry = tune(), learn_rate = tune()) %>%
xg.setup set_engine("xgboost") %>%
set_mode("classification")
<- workflow() %>%
xg.wf add_formula(Termd ~ .) %>%
add_model(xg.setup)
# xg.tune <- tune_grid(xg.wf,resamples = folds, grid = xg.grid,
# metrics = metric_set(accuracy,recall,roc_auc))
<- readRDS("model/xg_tune.rds")
xg.tune show_best(xg.tune,metric = "recall")
The best XGBooost model is created with 8 mtry, 700 trees and 0.1 learn_rate. we will use it as our main XGBoost model
<- xg.wf %>%
xg_best finalize_workflow(select_best(xg.tune,"recall")) %>%
fit(train_down)
## [14:17:29] WARNING: amalgamation/../src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
<- predict(xg_best,test,type = "prob")
xg_pred_2 <- as.factor(ifelse(xg_pred_2$.pred_0 > 0.5,0,1))
xg_pred_class2
<- confusionMatrix(xg_pred_class2,test$Termd)
conf_xg_2 conf_xg_2
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 367 27
## 1 66 185
##
## Accuracy : 0.8558
## 95% CI : (0.8263, 0.882)
## No Information Rate : 0.6713
## P-Value [Acc > NIR] : < 0.00000000000000022
##
## Kappa : 0.6879
##
## Mcnemar's Test P-Value : 0.00008134
##
## Sensitivity : 0.8476
## Specificity : 0.8726
## Pos Pred Value : 0.9315
## Neg Pred Value : 0.7371
## Prevalence : 0.6713
## Detection Rate : 0.5690
## Detection Prevalence : 0.6109
## Balanced Accuracy : 0.8601
##
## 'Positive' Class : 0
##
The recall is slightly better than previous XGboost model but still lower than tuned random forest. Let’s draw the ROC and calculate AUC
cbind(xg_pred_2,xg_pred_class2,test$Termd) %>%
setNames(c("prob_0","prob_1","pred_class","truth")) %>%
roc_curve(truth,prob_0) %>%
autoplot()
Random Forest ROC curve is slightly beter than this, the curve is closer to the top left.
cbind(xg_pred_2,xg_pred_class2,test$Termd) %>%
setNames(c("prob_0","prob_1","pred_class","truth")) %>%
roc_auc(truth, prob_0)
We have 0.92 AUC. it is true that in our case, Random Forest perform better than XGBoost. The Data Science team can develop this model to be used by HR Department to do the prediction.
Let’s summarise all the evaluation result
data.frame(accuracy = c(conf_dt$overall[1],conf_rf$overall[1],conf_xgb$overall[1],
$overall[1],conf_xg_2$overall[1]),
conf_rf_2recall = c(conf_dt$byClass[1],conf_rf$byClass[1],conf_xgb$byClass[1],
$byClass[1],conf_xg_2$byClass[1])) %>%
conf_rf_2`rownames<-`(c("Decision Tree","Random Forest","XGBoost","RF Tuned","XGB Tuned")) %>%
arrange(-recall)
Now we have our desired, most accurate model to predict which employee is most likely to resign. But the downside of robust algorithm like Random Forest is the diffulcties of intepretation. Some models can be easily interpreted, such as the linear or logistic regression model and decision trees. This sometimes drive the data scientist to choose more interpretable model since they need to communicate it to their manager or higher rank, who perhaps are not familiar with machine learning. To solve this problem we need additional tools to intepret our black box model, we can use a method called Local Interpretable Model-Agnostic Explanations (LIME)
Intuitively, you can check the importance of each variable from the model based on the impurity of each variables. Variable importance quantifies the global contribution of each input variable to the predictions of a machine learning model.
tidy(rf_best$fit$fit$fit$variable.importance) %>%
ggplot(aes(x = x, y = reorder(names,x))) +
geom_col(aes(fill = x),show.legend = F) +
scale_fill_continuous(low = "#3b3b3b",high = "#bf0808") +
labs(title = "Random Forest Variable Importance",
x = "",y = "") +
theme_minimal()
However, variable importance measures rarely give insight into the average direction that a variable affects a response function. They simply state the magnitude of a variable’s relationship with the response as compared to other variables used in the model. We can’t know specifically the influence of each factors for a single observation. That’s why we need LIME to help us understand individually what makes people resign.
NOTE: For learning purpose, i use non-recipe workflow when building and tuning the model. However this method is not compatible to LIME explanator. In the chunk below i recreate our model with same parameter setting as our best model. For those who want a clean way to modeling using tidymodels
, please take a look at this article
<- rand_forest(trees = 500, mtry = 8) %>%
best_setup set_engine("ranger",importance = "impurity") %>%
set_mode("classification")
<- fit_xy(object = best_setup,x = train_down %>% select(-Termd),
rf_bestx y = train_down$Termd)
set.seed(123)
<- lime(x = train_down %>% select(-Termd),
explainer model = rf_bestx)
## Warning: SpecialProjectsCount does not contain enough variance to use quantile
## binning. Using standard binning instead.
set.seed(123)
<- explain(x = test %>% select(-Termd) %>% dplyr::slice(1:2),
explanation labels = "0",
n_permutations = 500,
dist_fun = "manhattan",
explainer = explainer,
kernel_width = 3,
n_features = 10)
plot_features(explanation,ncol = 1)
We will take 2 observation for the example. The plot above is showing how our every variable take effect in prediction process based on our model. The text label:0 shows what value of target variable is being explained. The Probability shows the probability of the observation belong to the label “0” (not resign). The color of each bar represent whether the features support or contradict if the observations labeled as “0”.
The intepretation is quite simple, in observation one (left plot) “Position = Area Sales Manager” has the biggest weight to support the attrition to be 0 (not resign). This mean that employee who has such a position is most likely to stay in the company. in contrast, SpecialProjectCouts <=2 contradict the likelihood to not resign. means the less an employee was tasked with SpecialProjectsCounts, they tend to leave the company.
The next element is Explanation Fit. These values indicate how good LIME explain the model. Here we see the Explanation Fit for first observation is 57%. But its only 36% for observation 2, which can be interpreted that LIME can only explain a little about our model. You may consider not to trust the LIME output since it only has low Explanation Fit.
There are lots of interpretation we can take from the LIME plot above. Remember we only take 2 predicition as example to try LIME interpretation If you want more insight among different type of employee, you can always add more observation to the explainer. We as a data science team now have a justification to defend our model output. Its also can give the HRD team some insight why employee tends leave or stay in the company.
Publication:
Annotations
Yanai, I., Lercher, M. A hypothesis is a liability. Genome Biol 21, 231 (2020).↩︎