HR analytics is revolutionising the way human resources departments operate, leading to higher efficiency and better results overall. Human resources has been using analytics for years. However, the collection, processing and analysis of data has been largely manual, and given the nature of human resources dynamics and HR KPIs, the approach has been constraining HR. Therefore, it is surprising that HR departments woke up to the utility of machine learning so late in the game. In this opportunity, we’re going to do predictive analytics in identifying the employees most likely to get promoted.
Your client is a large MNC and they have 9 broad verticals across the organisation. One of the problem your client is facing is around identifying the right people for promotion (only for manager position and below) and prepare them in time. Currently the process, they are following is: 1. They first identify a set of employees based on recommendations/ past performance 2. Selected employees go through the separate training and evaluation program for each vertical. These programs are based on the required skill of each vertical 3. At the end of the program, based on various factors such as training performance, KPI completion (only employees with KPIs completed greater than 60% are considered) etc., employee gets promotion
For above mentioned process, the final promotions are only announced after the evaluation and this leads to delay in transition to their new roles. Hence, company needs your help in identifying the eligible candidates at a particular checkpoint so that they can expedite the entire promotion cycle. They have provided multiple attributes around Employee’s past and current performance along with demographics. Now, The task is to predict whether a potential promotee at checkpoint in the test set will be promoted or not after the evaluation process.
library(tidyverse)
library(zoo)
library(ggplot2)
library(plotly)
library(UBL)
library(tidymodels)
library(caret)
read.csv("data/train.csv",stringsAsFactors = T)
dat <-head(dat)
Variable Definition: - employee_id
: Unique ID for employee - department
: Department of employee - region
: Region of employment (unordered) - education
: Education Level - gender
: Gender of Employee - recruitment_channel
: Channel of recruitment for employee - no_of_trainings
: no of other trainings completed in previous year on soft skills, technical skills etc. - age
: Age of Employee - previous_year_rating
: Employee Rating for the previous year - length_of_service
: Length of service in years - KPIs_met
>80%: if Percent of KPIs(Key performance Indicators) >80% then 1 else 0 - awards_won
?: if awards won during previous year then 1 else 0 - avg_training_score
: Average score in current training evaluations - is_promoted
: (Target) Recommended for promotion
dat %>%
dat <- mutate(employee_id = as.character(employee_id),
KPIs_met..80. = as.factor(KPIs_met..80.),
awards_won. = as.factor(awards_won.),
is_promoted = as.factor(is_promoted),
previous_year_rating = replace_na(previous_year_rating,0), # fill na with 0
education = na_if(education,""),
education = na.locf(education), # fill blank with previous value
education = as.factor(education))
dat
Before we do the analysis, its often a good idea to know the data first. Employee in the company is well-seperated by their department. Lets see which department has the most promotion
%>%
dat group_by(is_promoted,department) %>%
summarise(freq = n()) %>%
filter(is_promoted == 1) %>%
arrange(-freq)
By quantity, employee in Sales & Marketing
may has the highest promoted but it might because the department also has the highest member than the others in the company.
prop.table(table(dat$department)) %>%
data.frame() %>%
arrange(-Freq)
It’s true that Sales & Marketing
department has the highest member in the company, so its makes sense if they also has the highest promotion. let’s see the proportion of employee being promoted from their own department
%>%
dat group_by(department,is_promoted) %>%
summarise(freq = n()) %>%
ungroup() %>%
pivot_wider(names_from = "is_promoted",values_from = "freq",names_prefix = "promoted") %>%
mutate(prop_promoted = promoted1/(promoted0+promoted1)) %>%
arrange(-prop_promoted) %>%
mutate(prop_promoted = scales::percent(prop_promoted))
Employee in Technology
department has the highest ‘chance’ to get promoted. 10.7% of them get a promotion compared to Sales & Marketing
which only 7.2%. So what makes Technology
department different? Lets see how employee in Technology
department performance compared with other department
dat %>%
p1 <- group_by(department,KPIs_met..80.,education) %>%
summarise(f = n()) %>%
ggplot(aes(x = f, y = department)) +
geom_col(aes(fill = department),show.legend = F) +
facet_grid(KPIs_met..80.~education,scales = "free_x") +
labs(title = "Employee KPI Completion",
subtitle = "Based on its Department and Education Levels",
caption = "Note: 1 = Met KPIs Completion, 0 = Doesn't meet",
x = "Freqency", y = "Department") +
theme_minimal() +
theme(strip.background = element_rect(fill = "firebrick"),
strip.text.x = element_text(colour = "white"))
p1
From the plot we know that employee with bachelor’s degree has low frequency to meet the KPI standard. The company decided to only give promotion to employee who has KPI completion greater than 60%. So it’s important to analyze the KPI variable first to narrow our analysis.
Employee who met the KPI standard are actually similar if we look at their education levels. Maybe there are another variables which has more significant impact to employee’s promotion.
dat %>%
p2 <- ggplot(aes(x = department,y = avg_training_score)) +
geom_boxplot(aes(fill = is_promoted)) +
facet_wrap(~department,scales = "free_x",nrow=1)+
scale_fill_discrete(name = "Is Promoted?",labels=c("No","Yes")) +
labs(title = "Employee Average Training Score",
subtitle = "Based on its Department",
x = "Department", y = "Average Training Score")+
theme_minimal() +
theme(axis.text.x = element_blank(),
strip.background = element_rect(fill = "firebrick"),
strip.text.x = element_text(colour = "white"),
legend.position = "bottom")
p2
In this plot, we analyze employee’s promotion from their average training score variable. It is save to assume that employee with higher average training score are most likely getting promoted. From this plot we also know that employee from Analytics, R&D, and Technology Department are the best department with highest average training score.
We know that education
may have low correlation to met the KPI
standards and high Average training score
tend to make the employees to get a promotion. Actually, we can calculate which variables has siginificant effect to employee’ promotion using logistic regression
$is_promoted <- relevel(dat$is_promoted,ref = "1") # set target reference to 1
dat# train logistic regression model using all variables
glm(is_promoted ~., data = dat[,-1],family = "binomial")
full_mod <-# apply backward stepwise to remove un-significant variables
::step(full_mod,direction = "backward",trace = 0) stats
##
## Call: glm(formula = is_promoted ~ department + region + education +
## no_of_trainings + age + previous_year_rating + length_of_service +
## KPIs_met..80. + awards_won. + avg_training_score, family = "binomial",
## data = dat[, -1])
##
## Coefficients:
## (Intercept) departmentFinance
## 29.548169 -7.174205
## departmentHR departmentLegal
## -10.099513 -7.002741
## departmentOperations departmentProcurement
## -7.355977 -4.459055
## departmentR&D departmentSales & Marketing
## 0.512821 -10.518131
## departmentTechnology regionregion_10
## -1.739387 -0.094159
## regionregion_11 regionregion_12
## 0.364154 0.434071
## regionregion_13 regionregion_14
## 0.004259 0.040362
## regionregion_15 regionregion_16
## 0.004422 0.171497
## regionregion_17 regionregion_18
## -0.501609 -0.293339
## regionregion_19 regionregion_2
## 0.121412 -0.072974
## regionregion_20 regionregion_21
## 0.399841 0.417358
## regionregion_22 regionregion_23
## -0.392316 -0.380982
## regionregion_24 regionregion_25
## 0.412778 -0.491390
## regionregion_26 regionregion_27
## 0.167603 -0.036712
## regionregion_28 regionregion_29
## -0.318568 0.571239
## regionregion_3 regionregion_30
## -0.235711 -0.098333
## regionregion_31 regionregion_32
## 0.299023 0.569194
## regionregion_33 regionregion_34
## 0.378497 0.971796
## regionregion_4 regionregion_5
## -0.621710 0.412832
## regionregion_6 regionregion_7
## 0.517562 -0.329530
## regionregion_8 regionregion_9
## 0.240501 1.225535
## educationBelow Secondary educationMaster's & above
## 0.016001 -0.145358
## no_of_trainings age
## 0.137131 0.029791
## previous_year_rating length_of_service
## -0.152352 -0.024143
## KPIs_met..80.1 awards_won.1
## -1.926362 -1.458020
## avg_training_score
## -0.309334
##
## Degrees of Freedom: 54807 Total (i.e. Null); 54757 Residual
## Null Deviance: 31920
## Residual Deviance: 21620 AIC: 21720
best logistic regression model from stepwise
glm(formula = is_promoted ~ department + region + education +
glm_best <- no_of_trainings + age + previous_year_rating + length_of_service +
KPIs_met..80. + awards_won. + avg_training_score, family = "binomial",
data = dat[, -1])
summary(glm_best)
##
## Call:
## glm(formula = is_promoted ~ department + region + education +
## no_of_trainings + age + previous_year_rating + length_of_service +
## KPIs_met..80. + awards_won. + avg_training_score, family = "binomial",
## data = dat[, -1])
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.4420 0.1250 0.2037 0.3575 1.9778
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 29.548169 0.494901 59.705 < 0.0000000000000002
## departmentFinance -7.174205 0.159999 -44.839 < 0.0000000000000002
## departmentHR -10.099513 0.210879 -47.893 < 0.0000000000000002
## departmentLegal -7.002741 0.212146 -33.009 < 0.0000000000000002
## departmentOperations -7.355977 0.140836 -52.231 < 0.0000000000000002
## departmentProcurement -4.459055 0.104416 -42.705 < 0.0000000000000002
## departmentR&D 0.512821 0.147105 3.486 0.000490
## departmentSales & Marketing -10.518131 0.187945 -55.964 < 0.0000000000000002
## departmentTechnology -1.739387 0.074965 -23.203 < 0.0000000000000002
## regionregion_10 -0.094159 0.238161 -0.395 0.692578
## regionregion_11 0.364154 0.216030 1.686 0.091861
## regionregion_12 0.434071 0.287772 1.508 0.131456
## regionregion_13 0.004259 0.186338 0.023 0.981763
## regionregion_14 0.040362 0.224058 0.180 0.857043
## regionregion_15 0.004422 0.186461 0.024 0.981079
## regionregion_16 0.171497 0.206302 0.831 0.405810
## regionregion_17 -0.501609 0.210437 -2.384 0.017142
## regionregion_18 -0.293339 1.056517 -0.278 0.781283
## regionregion_19 0.121412 0.232196 0.523 0.601055
## regionregion_2 -0.072974 0.172329 -0.423 0.671960
## regionregion_20 0.399841 0.240776 1.661 0.096787
## regionregion_21 0.417358 0.312030 1.338 0.181041
## regionregion_22 -0.392316 0.173100 -2.266 0.023426
## regionregion_23 -0.380982 0.198887 -1.916 0.055419
## regionregion_24 0.412778 0.307428 1.343 0.179375
## regionregion_25 -0.491390 0.207831 -2.364 0.018060
## regionregion_26 0.167603 0.193807 0.865 0.387151
## regionregion_27 -0.036712 0.198442 -0.185 0.853228
## regionregion_28 -0.318568 0.195881 -1.626 0.103879
## regionregion_29 0.571239 0.240103 2.379 0.017353
## regionregion_3 -0.235711 0.272519 -0.865 0.387075
## regionregion_30 -0.098333 0.237884 -0.413 0.679338
## regionregion_31 0.299023 0.202285 1.478 0.139347
## regionregion_32 0.569194 0.248595 2.290 0.022042
## regionregion_33 0.378497 0.379438 0.998 0.318512
## regionregion_34 0.971796 0.456933 2.127 0.033438
## regionregion_4 -0.621710 0.185899 -3.344 0.000825
## regionregion_5 0.412832 0.260493 1.585 0.113009
## regionregion_6 0.517562 0.267897 1.932 0.053366
## regionregion_7 -0.329530 0.175734 -1.875 0.060771
## regionregion_8 0.240501 0.240295 1.001 0.316896
## regionregion_9 1.225535 0.428696 2.859 0.004253
## educationBelow Secondary 0.016001 0.150294 0.106 0.915213
## educationMaster's & above -0.145358 0.043940 -3.308 0.000939
## no_of_trainings 0.137131 0.035005 3.918 0.000089469259937851
## age 0.029791 0.003660 8.140 0.000000000000000394
## previous_year_rating -0.152352 0.013628 -11.179 < 0.0000000000000002
## length_of_service -0.024143 0.006050 -3.991 0.000065899685701398
## KPIs_met..80.1 -1.926362 0.043827 -43.954 < 0.0000000000000002
## awards_won.1 -1.458020 0.078846 -18.492 < 0.0000000000000002
## avg_training_score -0.309334 0.005134 -60.254 < 0.0000000000000002
##
## (Intercept) ***
## departmentFinance ***
## departmentHR ***
## departmentLegal ***
## departmentOperations ***
## departmentProcurement ***
## departmentR&D ***
## departmentSales & Marketing ***
## departmentTechnology ***
## regionregion_10
## regionregion_11 .
## regionregion_12
## regionregion_13
## regionregion_14
## regionregion_15
## regionregion_16
## regionregion_17 *
## regionregion_18
## regionregion_19
## regionregion_2
## regionregion_20 .
## regionregion_21
## regionregion_22 *
## regionregion_23 .
## regionregion_24
## regionregion_25 *
## regionregion_26
## regionregion_27
## regionregion_28
## regionregion_29 *
## regionregion_3
## regionregion_30
## regionregion_31
## regionregion_32 *
## regionregion_33
## regionregion_34 *
## regionregion_4 ***
## regionregion_5
## regionregion_6 .
## regionregion_7 .
## regionregion_8
## regionregion_9 **
## educationBelow Secondary
## educationMaster's & above ***
## no_of_trainings ***
## age ***
## previous_year_rating ***
## length_of_service ***
## KPIs_met..80.1 ***
## awards_won.1 ***
## avg_training_score ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 31922 on 54807 degrees of freedom
## Residual deviance: 21623 on 54757 degrees of freedom
## AIC: 21725
##
## Number of Fisher Scoring iterations: 7
From the summary above we know that,statistically, not all variables have a significant effect to target variables. variables like gender
, recruitment channel
, and Previous year training
have no, if not low, effect on employee promotion. Every department
seems like has an equal chance of being promoted except R&D
department but it still has low p-value. From 34 Region
, region #4
have the lowest p-value, means employee from that region is most likely get a promotion. Almost every variable left has equal significant effect to target variable. We can summarize that employee need to increase their KPI
, Training score
, loyalty that can be seen from length of service
, number of trainings
, and won an award
to become a candidate for promotion.
::ggcorr(dat %>% select(.,is.numeric),label = T) GGally
age
and length_of_service
has high correlation. it make sense but we need to do something about it to avoid multicollinearity. we will do some feature engineering before modeling
table(dat$no_of_trainings)
##
## 1 2 3 4 5 6 7 8 9 10
## 44378 7987 1776 468 128 44 12 5 5 5
We also need to change no_of_trainings
to 2 tier since the variable is so imbalance and that isn’t good if we want to scale the numeric data based on its median.
# set employee id to rownames
rownames(dat) <- dat$employee_id
dat[,-1] dat <-
# feature engineering
# change age to 3 tier and no_of_trainings to 2 tier
# remove duplicate row
dat %>%
dat <- mutate(age = as.factor(case_when(age < 30 ~ "junior",
>= 30 & age <=40 ~ "fresh",
age >40 ~ "senior")),
age no_of_trainings = as.factor(ifelse(no_of_trainings < 2,"rarely","often"))) %>%
distinct()
# robust scaling using median instead of mean
dat %>%
dat_num <- select_if(is.numeric) %>%
quantable::robustscale()
cbind(dat_num$data,dat %>% select_if(is.factor))
dat <- dat
set.seed(123)
initial_split(dat,prop = 0.8,strata = "is_promoted")
splitter <- training(splitter)
train <- testing(splitter) test <-
prop.table(table(train$is_promoted))
##
## 1 0
## 0.08701777 0.91298223
The target level is imbalance so we need to balance it using downsampling. We also dont want the data to be exact 50:50. We want the unbalance nature to still exist in train data since that’s what actually happen in real life.
set.seed(123)
recipe(is_promoted~., data = train) %>%
recipe <- step_downsample(is_promoted,under_ratio = 1/0.55,seed = 123) %>%
prep()
juice(recipe)
train_balance <- bake(recipe,test) test <-
prop.table(table(train_balance$is_promoted))
##
## 1 0
## 0.3548449 0.6451551
Set k-fold cross validation
set.seed(123)
vfold_cv(train_balance,3,strata = "is_promoted") folds <-
grid tuning. we will tune the mtry
and trees
parameter using grid tuning. this process will take a lot of time because it will train all possible paramter combination and extract the best result. we will aim the highest F1 score
as our best model.
# trees and mtry grid combination
expand.grid(trees = seq(450,650,50), mtry = 3:7)
rf.grid <-
# model setup. random forest using ranger engine where the trees and mtry
# will be changed by its grid
rand_forest(trees = tune(), mtry = tune()) %>%
rf.setup <- set_engine("ranger") %>%
set_mode("classification")
# formula workflow
workflow() %>%
rf.wf <- add_model(rf.setup) %>%
add_recipe(recipe(is_promoted ~.,data = train_balance))
# fit the data to model workflow
# rf.tune <- tune_grid(rf.wf,resamples = folds,grid = rf.grid,
# metrics = metric_set(accuracy, sens, spec,yardstick::precision,f_meas))
readRDS("rf_tune.rds")
rf.tune <-show_best(rf.tune,metric = "f_meas")
random forest with 7 mtry and 500 tree is the best model based on f1 score. we will use it as our main random forest model
rf.wf %>%
rf_best <- finalize_workflow(select_best(rf.tune,"f_meas")) %>%
fit(train_balance)
predict(rf_best,test,type = "prob") rf_pred <-
We have imbalance data so it isn’t wise to use accuracy
as our metric. we predict the test data and pull the probability instead the predicted class. From the probabilty, we will set the optimal cutoff with the most balance precision
and recall.
we will use cmplot
package to see the cutoff. you can install cmplot
package from this github https://github.com/ahmadhusain/cmplot
library(cmplot)
confmat_plot(prob = rf_pred$.pred_1,ref = test$is_promoted,postarget = "1",negtarget = "0")
From the plot above, it looks like cutoff = 0.6483 is the best cutoff to get the most balanced precision and recall. let’s how it seen in confusion matrix
$class <- as.factor(ifelse(rf_pred$.pred_1 > 0.6483,"1","0"))
rf_pred confusionMatrix(rf_pred$class,test$is_promoted,positive = "1")
cm_rf <- cm_rf
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 0
## 1 456 500
## 0 490 9200
##
## Accuracy : 0.907
## 95% CI : (0.9013, 0.9125)
## No Information Rate : 0.9111
## P-Value [Acc > NIR] : 0.9345
##
## Kappa : 0.4284
##
## Mcnemar's Test P-Value : 0.7748
##
## Sensitivity : 0.48203
## Specificity : 0.94845
## Pos Pred Value : 0.47699
## Neg Pred Value : 0.94943
## Prevalence : 0.08886
## Detection Rate : 0.04283
## Detection Prevalence : 0.08980
## Balanced Accuracy : 0.71524
##
## 'Positive' Class : 1
##
We have high accuracy: 90.7% but low precision and sensitivity. the matrix also does’nt look that good because we have lots of false positive and false negative. but that’s still tolerable considering this is an imbalance target classes
Let’s try another algorithm. this time we will use XGBoost.
expand.grid(trees = seq(450,650,50), mtry = 3:7)
rf.grid <-
boost_tree(trees = tune(), mtry = tune(),learn_rate = 0.1,tree_depth = 5) %>%
xg.setup <- set_engine("xgboost") %>%
set_mode("classification")
workflow() %>%
xg.wf <- add_formula(is_promoted ~ .) %>%
add_model(xg.setup)
# xg.tune <- tune_grid(xg.wf,resamples = folds, grid = rf.grid,
# metrics = metric_set(accuracy, sens, spec,yardstick::precision,f_meas))
readRDS("xg_tune.rds")
xg.tune <-show_best(xg.tune,metric = "f_meas")
random forest with 7 mtry and 650 tree is the best model based on f1 score. we will use it as our main XGBoost model
xg.wf %>%
xg_best <- finalize_workflow(select_best(xg.tune,"f_meas")) %>%
fit(train_balance)
predict(xg_best,test,type = "prob") xg_pred <-
confmat_plot(prob = xg_pred$.pred_1,ref = test$is_promoted,postarget = "1",negtarget = "0")
From the plot above, it looks like cutoff = 0.6483 is the best cutoff to get the most balanced precision and recall. let’s see how it looks in confusion matrix
$class <- as.factor(ifelse(xg_pred$.pred_1 > 0.6483,"1","0"))
xg_pred confusionMatrix(xg_pred$class,test$is_promoted,positive = "1")
cm_xg <- cm_xg
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 0
## 1 445 447
## 0 501 9253
##
## Accuracy : 0.911
## 95% CI : (0.9054, 0.9163)
## No Information Rate : 0.9111
## P-Value [Acc > NIR] : 0.53577
##
## Kappa : 0.4355
##
## Mcnemar's Test P-Value : 0.08519
##
## Sensitivity : 0.47040
## Specificity : 0.95392
## Pos Pred Value : 0.49888
## Neg Pred Value : 0.94864
## Prevalence : 0.08886
## Detection Rate : 0.04180
## Detection Prevalence : 0.08379
## Balanced Accuracy : 0.71216
##
## 'Positive' Class : 1
##
We have higher accuracy than Random forest model: 90.9%. Still low precision and sensitivity, but slightly better than random forest.
data.frame(
eval <-Accuracy = c(cm_rf$overall[1],cm_xg$overall[1]),
Sensitivity = c(cm_rf$byClass[1],cm_xg$byClass[1]),
Specificity = c(cm_rf$byClass[2],cm_xg$byClass[2]),
Precision = c(cm_rf$byClass[5],cm_xg$byClass[5]),
Recall = c(cm_rf$byClass[6],cm_xg$byClass[6]),
F1 = c(cm_rf$byClass[7],cm_xg$byClass[7])
%>% `rownames<-`(c("Random Forest","XGBoost"))
)
eval