HR analytics is revolutionising the way human resources departments operate, leading to higher efficiency and better results overall. Human resources has been using analytics for years. However, the collection, processing and analysis of data has been largely manual, and given the nature of human resources dynamics and HR KPIs, the approach has been constraining HR. Therefore, it is surprising that HR departments woke up to the utility of machine learning so late in the game. In this opportunity, we’re going to do predictive analytics in identifying the employees most likely to get promoted.

Background

Objective

Your client is a large MNC and they have 9 broad verticals across the organisation. One of the problem your client is facing is around identifying the right people for promotion (only for manager position and below) and prepare them in time. Currently the process, they are following is: 1. They first identify a set of employees based on recommendations/ past performance 2. Selected employees go through the separate training and evaluation program for each vertical. These programs are based on the required skill of each vertical 3. At the end of the program, based on various factors such as training performance, KPI completion (only employees with KPIs completed greater than 60% are considered) etc., employee gets promotion

For above mentioned process, the final promotions are only announced after the evaluation and this leads to delay in transition to their new roles. Hence, company needs your help in identifying the eligible candidates at a particular checkpoint so that they can expedite the entire promotion cycle. They have provided multiple attributes around Employee’s past and current performance along with demographics. Now, The task is to predict whether a potential promotee at checkpoint in the test set will be promoted or not after the evaluation process.

Libraries

library(tidyverse)
library(zoo)
library(ggplot2)
library(plotly)
library(UBL)
library(tidymodels)
library(caret)

Let’s Begin

Data Import

dat <- read.csv("data/train.csv",stringsAsFactors = T)
head(dat)

Variable Definition: - employee_id: Unique ID for employee - department: Department of employee - region: Region of employment (unordered) - education: Education Level - gender: Gender of Employee - recruitment_channel: Channel of recruitment for employee - no_of_trainings: no of other trainings completed in previous year on soft skills, technical skills etc. - age: Age of Employee - previous_year_rating: Employee Rating for the previous year - length_of_service: Length of service in years - KPIs_met >80%: if Percent of KPIs(Key performance Indicators) >80% then 1 else 0 - awards_won?: if awards won during previous year then 1 else 0 - avg_training_score: Average score in current training evaluations - is_promoted: (Target) Recommended for promotion

dat <- dat %>% 
  mutate(employee_id = as.character(employee_id),
         KPIs_met..80. = as.factor(KPIs_met..80.),
         awards_won. = as.factor(awards_won.),
         is_promoted = as.factor(is_promoted),
         previous_year_rating = replace_na(previous_year_rating,0), # fill na with 0
         education = na_if(education,""),
         education = na.locf(education), # fill blank with previous value
         education = as.factor(education))
dat

Exploratory Data Analysis

Before we do the analysis, its often a good idea to know the data first. Employee in the company is well-seperated by their department. Lets see which department has the most promotion

dat %>% 
  group_by(is_promoted,department) %>% 
  summarise(freq = n()) %>% 
  filter(is_promoted == 1) %>% 
  arrange(-freq)

By quantity, employee in Sales & Marketing may has the highest promoted but it might because the department also has the highest member than the others in the company.

prop.table(table(dat$department)) %>% 
  data.frame() %>% 
  arrange(-Freq)

It’s true that Sales & Marketing department has the highest member in the company, so its makes sense if they also has the highest promotion. let’s see the proportion of employee being promoted from their own department

dat %>% 
  group_by(department,is_promoted) %>% 
  summarise(freq = n()) %>% 
  ungroup() %>% 
  pivot_wider(names_from = "is_promoted",values_from = "freq",names_prefix = "promoted") %>% 
  mutate(prop_promoted = promoted1/(promoted0+promoted1)) %>% 
  arrange(-prop_promoted) %>% 
  mutate(prop_promoted = scales::percent(prop_promoted))

Employee in Technology department has the highest ‘chance’ to get promoted. 10.7% of them get a promotion compared to Sales & Marketing which only 7.2%. So what makes Technology department different? Lets see how employee in Technology department performance compared with other department

p1 <- dat %>% 
  group_by(department,KPIs_met..80.,education) %>% 
  summarise(f = n()) %>% 
  ggplot(aes(x = f, y = department)) +
  geom_col(aes(fill = department),show.legend = F) +
  facet_grid(KPIs_met..80.~education,scales = "free_x") +
  labs(title = "Employee KPI Completion",
       subtitle = "Based on its Department and Education Levels",
       caption = "Note: 1 = Met KPIs Completion, 0 = Doesn't meet",
       x = "Freqency", y = "Department") +
  theme_minimal() +
  theme(strip.background = element_rect(fill = "firebrick"),
         strip.text.x = element_text(colour = "white"))
p1

From the plot we know that employee with bachelor’s degree has low frequency to meet the KPI standard. The company decided to only give promotion to employee who has KPI completion greater than 60%. So it’s important to analyze the KPI variable first to narrow our analysis.

Employee who met the KPI standard are actually similar if we look at their education levels. Maybe there are another variables which has more significant impact to employee’s promotion.

p2 <- dat %>% 
  ggplot(aes(x = department,y = avg_training_score)) +
  geom_boxplot(aes(fill = is_promoted)) +
  facet_wrap(~department,scales = "free_x",nrow=1)+
  scale_fill_discrete(name = "Is Promoted?",labels=c("No","Yes")) +
  labs(title = "Employee Average Training Score",
       subtitle = "Based on its Department",
       x = "Department", y = "Average Training Score")+
  theme_minimal() +
  theme(axis.text.x = element_blank(),
        strip.background = element_rect(fill = "firebrick"),
        strip.text.x = element_text(colour = "white"),
        legend.position = "bottom")

p2

In this plot, we analyze employee’s promotion from their average training score variable. It is save to assume that employee with higher average training score are most likely getting promoted. From this plot we also know that employee from Analytics, R&D, and Technology Department are the best department with highest average training score.

We know that education may have low correlation to met the KPI standards and high Average training score tend to make the employees to get a promotion. Actually, we can calculate which variables has siginificant effect to employee’ promotion using logistic regression

Logistic Regression

dat$is_promoted <- relevel(dat$is_promoted,ref = "1") # set target reference to 1
# train logistic regression model using all variables
full_mod <- glm(is_promoted ~., data = dat[,-1],family = "binomial") 
# apply backward stepwise to remove un-significant variables 
stats::step(full_mod,direction = "backward",trace = 0)
## 
## Call:  glm(formula = is_promoted ~ department + region + education + 
##     no_of_trainings + age + previous_year_rating + length_of_service + 
##     KPIs_met..80. + awards_won. + avg_training_score, family = "binomial", 
##     data = dat[, -1])
## 
## Coefficients:
##                 (Intercept)            departmentFinance  
##                   29.548169                    -7.174205  
##                departmentHR              departmentLegal  
##                  -10.099513                    -7.002741  
##        departmentOperations        departmentProcurement  
##                   -7.355977                    -4.459055  
##               departmentR&D  departmentSales & Marketing  
##                    0.512821                   -10.518131  
##        departmentTechnology              regionregion_10  
##                   -1.739387                    -0.094159  
##             regionregion_11              regionregion_12  
##                    0.364154                     0.434071  
##             regionregion_13              regionregion_14  
##                    0.004259                     0.040362  
##             regionregion_15              regionregion_16  
##                    0.004422                     0.171497  
##             regionregion_17              regionregion_18  
##                   -0.501609                    -0.293339  
##             regionregion_19               regionregion_2  
##                    0.121412                    -0.072974  
##             regionregion_20              regionregion_21  
##                    0.399841                     0.417358  
##             regionregion_22              regionregion_23  
##                   -0.392316                    -0.380982  
##             regionregion_24              regionregion_25  
##                    0.412778                    -0.491390  
##             regionregion_26              regionregion_27  
##                    0.167603                    -0.036712  
##             regionregion_28              regionregion_29  
##                   -0.318568                     0.571239  
##              regionregion_3              regionregion_30  
##                   -0.235711                    -0.098333  
##             regionregion_31              regionregion_32  
##                    0.299023                     0.569194  
##             regionregion_33              regionregion_34  
##                    0.378497                     0.971796  
##              regionregion_4               regionregion_5  
##                   -0.621710                     0.412832  
##              regionregion_6               regionregion_7  
##                    0.517562                    -0.329530  
##              regionregion_8               regionregion_9  
##                    0.240501                     1.225535  
##    educationBelow Secondary    educationMaster's & above  
##                    0.016001                    -0.145358  
##             no_of_trainings                          age  
##                    0.137131                     0.029791  
##        previous_year_rating            length_of_service  
##                   -0.152352                    -0.024143  
##              KPIs_met..80.1                 awards_won.1  
##                   -1.926362                    -1.458020  
##          avg_training_score  
##                   -0.309334  
## 
## Degrees of Freedom: 54807 Total (i.e. Null);  54757 Residual
## Null Deviance:       31920 
## Residual Deviance: 21620     AIC: 21720

best logistic regression model from stepwise

glm_best <- glm(formula = is_promoted ~ department + region + education + 
    no_of_trainings + age + previous_year_rating + length_of_service + 
    KPIs_met..80. + awards_won. + avg_training_score, family = "binomial", 
    data = dat[, -1])
summary(glm_best)
## 
## Call:
## glm(formula = is_promoted ~ department + region + education + 
##     no_of_trainings + age + previous_year_rating + length_of_service + 
##     KPIs_met..80. + awards_won. + avg_training_score, family = "binomial", 
##     data = dat[, -1])
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.4420   0.1250   0.2037   0.3575   1.9778  
## 
## Coefficients:
##                               Estimate Std. Error z value             Pr(>|z|)
## (Intercept)                  29.548169   0.494901  59.705 < 0.0000000000000002
## departmentFinance            -7.174205   0.159999 -44.839 < 0.0000000000000002
## departmentHR                -10.099513   0.210879 -47.893 < 0.0000000000000002
## departmentLegal              -7.002741   0.212146 -33.009 < 0.0000000000000002
## departmentOperations         -7.355977   0.140836 -52.231 < 0.0000000000000002
## departmentProcurement        -4.459055   0.104416 -42.705 < 0.0000000000000002
## departmentR&D                 0.512821   0.147105   3.486             0.000490
## departmentSales & Marketing -10.518131   0.187945 -55.964 < 0.0000000000000002
## departmentTechnology         -1.739387   0.074965 -23.203 < 0.0000000000000002
## regionregion_10              -0.094159   0.238161  -0.395             0.692578
## regionregion_11               0.364154   0.216030   1.686             0.091861
## regionregion_12               0.434071   0.287772   1.508             0.131456
## regionregion_13               0.004259   0.186338   0.023             0.981763
## regionregion_14               0.040362   0.224058   0.180             0.857043
## regionregion_15               0.004422   0.186461   0.024             0.981079
## regionregion_16               0.171497   0.206302   0.831             0.405810
## regionregion_17              -0.501609   0.210437  -2.384             0.017142
## regionregion_18              -0.293339   1.056517  -0.278             0.781283
## regionregion_19               0.121412   0.232196   0.523             0.601055
## regionregion_2               -0.072974   0.172329  -0.423             0.671960
## regionregion_20               0.399841   0.240776   1.661             0.096787
## regionregion_21               0.417358   0.312030   1.338             0.181041
## regionregion_22              -0.392316   0.173100  -2.266             0.023426
## regionregion_23              -0.380982   0.198887  -1.916             0.055419
## regionregion_24               0.412778   0.307428   1.343             0.179375
## regionregion_25              -0.491390   0.207831  -2.364             0.018060
## regionregion_26               0.167603   0.193807   0.865             0.387151
## regionregion_27              -0.036712   0.198442  -0.185             0.853228
## regionregion_28              -0.318568   0.195881  -1.626             0.103879
## regionregion_29               0.571239   0.240103   2.379             0.017353
## regionregion_3               -0.235711   0.272519  -0.865             0.387075
## regionregion_30              -0.098333   0.237884  -0.413             0.679338
## regionregion_31               0.299023   0.202285   1.478             0.139347
## regionregion_32               0.569194   0.248595   2.290             0.022042
## regionregion_33               0.378497   0.379438   0.998             0.318512
## regionregion_34               0.971796   0.456933   2.127             0.033438
## regionregion_4               -0.621710   0.185899  -3.344             0.000825
## regionregion_5                0.412832   0.260493   1.585             0.113009
## regionregion_6                0.517562   0.267897   1.932             0.053366
## regionregion_7               -0.329530   0.175734  -1.875             0.060771
## regionregion_8                0.240501   0.240295   1.001             0.316896
## regionregion_9                1.225535   0.428696   2.859             0.004253
## educationBelow Secondary      0.016001   0.150294   0.106             0.915213
## educationMaster's & above    -0.145358   0.043940  -3.308             0.000939
## no_of_trainings               0.137131   0.035005   3.918 0.000089469259937851
## age                           0.029791   0.003660   8.140 0.000000000000000394
## previous_year_rating         -0.152352   0.013628 -11.179 < 0.0000000000000002
## length_of_service            -0.024143   0.006050  -3.991 0.000065899685701398
## KPIs_met..80.1               -1.926362   0.043827 -43.954 < 0.0000000000000002
## awards_won.1                 -1.458020   0.078846 -18.492 < 0.0000000000000002
## avg_training_score           -0.309334   0.005134 -60.254 < 0.0000000000000002
##                                
## (Intercept)                 ***
## departmentFinance           ***
## departmentHR                ***
## departmentLegal             ***
## departmentOperations        ***
## departmentProcurement       ***
## departmentR&D               ***
## departmentSales & Marketing ***
## departmentTechnology        ***
## regionregion_10                
## regionregion_11             .  
## regionregion_12                
## regionregion_13                
## regionregion_14                
## regionregion_15                
## regionregion_16                
## regionregion_17             *  
## regionregion_18                
## regionregion_19                
## regionregion_2                 
## regionregion_20             .  
## regionregion_21                
## regionregion_22             *  
## regionregion_23             .  
## regionregion_24                
## regionregion_25             *  
## regionregion_26                
## regionregion_27                
## regionregion_28                
## regionregion_29             *  
## regionregion_3                 
## regionregion_30                
## regionregion_31                
## regionregion_32             *  
## regionregion_33                
## regionregion_34             *  
## regionregion_4              ***
## regionregion_5                 
## regionregion_6              .  
## regionregion_7              .  
## regionregion_8                 
## regionregion_9              ** 
## educationBelow Secondary       
## educationMaster's & above   ***
## no_of_trainings             ***
## age                         ***
## previous_year_rating        ***
## length_of_service           ***
## KPIs_met..80.1              ***
## awards_won.1                ***
## avg_training_score          ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 31922  on 54807  degrees of freedom
## Residual deviance: 21623  on 54757  degrees of freedom
## AIC: 21725
## 
## Number of Fisher Scoring iterations: 7

From the summary above we know that,statistically, not all variables have a significant effect to target variables. variables like gender, recruitment channel, and Previous year training have no, if not low, effect on employee promotion. Every department seems like has an equal chance of being promoted except R&D department but it still has low p-value. From 34 Region, region #4 have the lowest p-value, means employee from that region is most likely get a promotion. Almost every variable left has equal significant effect to target variable. We can summarize that employee need to increase their KPI, Training score, loyalty that can be seen from length of service, number of trainings, and won an award to become a candidate for promotion.

GGally::ggcorr(dat %>% select(.,is.numeric),label = T)

age and length_of_service has high correlation. it make sense but we need to do something about it to avoid multicollinearity. we will do some feature engineering before modeling

table(dat$no_of_trainings)
## 
##     1     2     3     4     5     6     7     8     9    10 
## 44378  7987  1776   468   128    44    12     5     5     5

We also need to change no_of_trainings to 2 tier since the variable is so imbalance and that isn’t good if we want to scale the numeric data based on its median.

Modeling

# set employee id to rownames
rownames(dat) <- dat$employee_id
dat <- dat[,-1]
# feature engineering
# change age to 3 tier and no_of_trainings to 2 tier
# remove duplicate row
dat <- dat %>% 
  mutate(age = as.factor(case_when(age < 30 ~ "junior",
                         age >= 30 & age <=40 ~ "fresh",
                         age >40 ~ "senior")),
         no_of_trainings = as.factor(ifelse(no_of_trainings < 2,"rarely","often"))) %>% 
  distinct()

# robust scaling using median instead of mean
dat_num <- dat %>% 
  select_if(is.numeric) %>% 
  quantable::robustscale()

dat <- cbind(dat_num$data,dat %>% select_if(is.factor))
dat

Cross validation

set.seed(123)
splitter <- initial_split(dat,prop = 0.8,strata = "is_promoted")
train <- training(splitter)
test <- testing(splitter)

Balancing

prop.table(table(train$is_promoted))
## 
##          1          0 
## 0.08701777 0.91298223

The target level is imbalance so we need to balance it using downsampling. We also dont want the data to be exact 50:50. We want the unbalance nature to still exist in train data since that’s what actually happen in real life.

set.seed(123)
recipe <- recipe(is_promoted~., data = train) %>% 
  step_downsample(is_promoted,under_ratio = 1/0.55,seed = 123) %>% 
  prep()

train_balance <- juice(recipe)
test <- bake(recipe,test)
prop.table(table(train_balance$is_promoted))
## 
##         1         0 
## 0.3548449 0.6451551

Random Forest

Set k-fold cross validation

set.seed(123)
folds <- vfold_cv(train_balance,3,strata = "is_promoted")

grid tuning. we will tune the mtry and trees parameter using grid tuning. this process will take a lot of time because it will train all possible paramter combination and extract the best result. we will aim the highest F1 score as our best model.

# trees and mtry grid combination 
rf.grid <- expand.grid(trees = seq(450,650,50), mtry = 3:7)

# model setup. random forest using ranger engine where the trees and mtry 
# will be changed by its grid
rf.setup <- rand_forest(trees = tune(), mtry = tune()) %>%
  set_engine("ranger") %>%
  set_mode("classification")

# formula workflow
rf.wf <- workflow() %>% 
  add_model(rf.setup) %>% 
  add_recipe(recipe(is_promoted ~.,data = train_balance))

# fit the data to model workflow
# rf.tune <- tune_grid(rf.wf,resamples = folds,grid = rf.grid,
#                      metrics = metric_set(accuracy, sens, spec,yardstick::precision,f_meas))
rf.tune <- readRDS("rf_tune.rds")
show_best(rf.tune,metric = "f_meas")

random forest with 7 mtry and 500 tree is the best model based on f1 score. we will use it as our main random forest model

rf_best <- rf.wf %>% 
  finalize_workflow(select_best(rf.tune,"f_meas")) %>% 
  fit(train_balance)

rf_pred <- predict(rf_best,test,type = "prob")

We have imbalance data so it isn’t wise to use accuracy as our metric. we predict the test data and pull the probability instead the predicted class. From the probabilty, we will set the optimal cutoff with the most balance precision and recall. we will use cmplot package to see the cutoff. you can install cmplot package from this github https://github.com/ahmadhusain/cmplot

library(cmplot)
confmat_plot(prob = rf_pred$.pred_1,ref = test$is_promoted,postarget = "1",negtarget = "0")

From the plot above, it looks like cutoff = 0.6483 is the best cutoff to get the most balanced precision and recall. let’s how it seen in confusion matrix

rf_pred$class <- as.factor(ifelse(rf_pred$.pred_1 > 0.6483,"1","0"))
cm_rf <- confusionMatrix(rf_pred$class,test$is_promoted,positive = "1")
cm_rf
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    1    0
##          1  456  500
##          0  490 9200
##                                           
##                Accuracy : 0.907           
##                  95% CI : (0.9013, 0.9125)
##     No Information Rate : 0.9111          
##     P-Value [Acc > NIR] : 0.9345          
##                                           
##                   Kappa : 0.4284          
##                                           
##  Mcnemar's Test P-Value : 0.7748          
##                                           
##             Sensitivity : 0.48203         
##             Specificity : 0.94845         
##          Pos Pred Value : 0.47699         
##          Neg Pred Value : 0.94943         
##              Prevalence : 0.08886         
##          Detection Rate : 0.04283         
##    Detection Prevalence : 0.08980         
##       Balanced Accuracy : 0.71524         
##                                           
##        'Positive' Class : 1               
## 

We have high accuracy: 90.7% but low precision and sensitivity. the matrix also does’nt look that good because we have lots of false positive and false negative. but that’s still tolerable considering this is an imbalance target classes

Let’s try another algorithm. this time we will use XGBoost.

XGBoost

rf.grid <- expand.grid(trees = seq(450,650,50), mtry = 3:7)

xg.setup <- boost_tree(trees = tune(), mtry = tune(),learn_rate = 0.1,tree_depth = 5) %>%
  set_engine("xgboost") %>%
  set_mode("classification")

xg.wf <- workflow() %>% 
  add_formula(is_promoted ~ .) %>% 
  add_model(xg.setup)

# xg.tune <- tune_grid(xg.wf,resamples = folds, grid = rf.grid,
#                      metrics = metric_set(accuracy, sens, spec,yardstick::precision,f_meas))
xg.tune <- readRDS("xg_tune.rds")
show_best(xg.tune,metric = "f_meas")

random forest with 7 mtry and 650 tree is the best model based on f1 score. we will use it as our main XGBoost model

xg_best <- xg.wf %>% 
  finalize_workflow(select_best(xg.tune,"f_meas")) %>% 
  fit(train_balance)

xg_pred <- predict(xg_best,test,type = "prob")
confmat_plot(prob = xg_pred$.pred_1,ref = test$is_promoted,postarget = "1",negtarget = "0")

From the plot above, it looks like cutoff = 0.6483 is the best cutoff to get the most balanced precision and recall. let’s see how it looks in confusion matrix

xg_pred$class <- as.factor(ifelse(xg_pred$.pred_1 > 0.6483,"1","0"))
cm_xg <- confusionMatrix(xg_pred$class,test$is_promoted,positive = "1")
cm_xg
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    1    0
##          1  445  447
##          0  501 9253
##                                           
##                Accuracy : 0.911           
##                  95% CI : (0.9054, 0.9163)
##     No Information Rate : 0.9111          
##     P-Value [Acc > NIR] : 0.53577         
##                                           
##                   Kappa : 0.4355          
##                                           
##  Mcnemar's Test P-Value : 0.08519         
##                                           
##             Sensitivity : 0.47040         
##             Specificity : 0.95392         
##          Pos Pred Value : 0.49888         
##          Neg Pred Value : 0.94864         
##              Prevalence : 0.08886         
##          Detection Rate : 0.04180         
##    Detection Prevalence : 0.08379         
##       Balanced Accuracy : 0.71216         
##                                           
##        'Positive' Class : 1               
## 

We have higher accuracy than Random forest model: 90.9%. Still low precision and sensitivity, but slightly better than random forest.

Model Evaluation

eval <- data.frame(
   Accuracy = c(cm_rf$overall[1],cm_xg$overall[1]),
   Sensitivity = c(cm_rf$byClass[1],cm_xg$byClass[1]),
   Specificity = c(cm_rf$byClass[2],cm_xg$byClass[2]),
   Precision = c(cm_rf$byClass[5],cm_xg$byClass[5]),
   Recall = c(cm_rf$byClass[6],cm_xg$byClass[6]),
   F1 = c(cm_rf$byClass[7],cm_xg$byClass[7])
) %>% `rownames<-`(c("Random Forest","XGBoost"))

eval