Abstract

We work with the human resources data available on Kaggle. This is a mock dataset, so results may not be realistic. The idea is to test 3 different machine learning algorithms to predict if a worker will leave or not, this may not be the best use of this dataset, but it’s a very clean one on which to run algorithms. Before ftting the models a basic exploratory analysis is conducted.

Exploratory Analysis

We begin by doing some basic exploring on the data set.

Load the libraries and the data.

library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(ggplot2)

myData <- read.csv("HR_comma_sep.csv")

Let’s take a quick look at the data

str(myData)
## 'data.frame':    14999 obs. of  10 variables:
##  $ satisfaction_level   : num  0.38 0.8 0.11 0.72 0.37 0.41 0.1 0.92 0.89 0.42 ...
##  $ last_evaluation      : num  0.53 0.86 0.88 0.87 0.52 0.5 0.77 0.85 1 0.53 ...
##  $ number_project       : int  2 5 7 5 2 2 6 5 5 2 ...
##  $ average_montly_hours : int  157 262 272 223 159 153 247 259 224 142 ...
##  $ time_spend_company   : int  3 6 4 5 3 3 4 5 5 3 ...
##  $ Work_accident        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ left                 : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ promotion_last_5years: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ sales                : Factor w/ 10 levels "accounting","hr",..: 8 8 8 8 8 8 8 8 8 8 ...
##  $ salary               : Factor w/ 3 levels "high","low","medium": 2 3 3 2 2 2 2 2 2 2 ...
myData$left <- as.factor(myData$left)
sum(is.na(myData))
## [1] 0
summary(myData)
##  satisfaction_level last_evaluation  number_project  average_montly_hours
##  Min.   :0.0900     Min.   :0.3600   Min.   :2.000   Min.   : 96.0       
##  1st Qu.:0.4400     1st Qu.:0.5600   1st Qu.:3.000   1st Qu.:156.0       
##  Median :0.6400     Median :0.7200   Median :4.000   Median :200.0       
##  Mean   :0.6128     Mean   :0.7161   Mean   :3.803   Mean   :201.1       
##  3rd Qu.:0.8200     3rd Qu.:0.8700   3rd Qu.:5.000   3rd Qu.:245.0       
##  Max.   :1.0000     Max.   :1.0000   Max.   :7.000   Max.   :310.0       
##                                                                          
##  time_spend_company Work_accident    left      promotion_last_5years
##  Min.   : 2.000     Min.   :0.0000   0:11428   Min.   :0.00000      
##  1st Qu.: 3.000     1st Qu.:0.0000   1: 3571   1st Qu.:0.00000      
##  Median : 3.000     Median :0.0000             Median :0.00000      
##  Mean   : 3.498     Mean   :0.1446             Mean   :0.02127      
##  3rd Qu.: 4.000     3rd Qu.:0.0000             3rd Qu.:0.00000      
##  Max.   :10.000     Max.   :1.0000             Max.   :1.00000      
##                                                                     
##          sales         salary    
##  sales      :4140   high  :1237  
##  technical  :2720   low   :7316  
##  support    :2229   medium:6446  
##  IT         :1227                
##  product_mng: 902                
##  marketing  : 858                
##  (Other)    :2923

The main interest is in the variable “left”, where a 1 indicates that the employee left.

table(myData$left)
## 
##     0     1 
## 11428  3571

We can see that roughly 30% of the workers left.

Let’s take a look at the distribution of a few other variables.

table(myData$sales)
## 
##  accounting          hr          IT  management   marketing product_mng 
##         767         739        1227         630         858         902 
##       RandD       sales     support   technical 
##         787        4140        2229        2720
hist(myData$average_montly_hours, main = "Average Monthly Hours")

Now let’s try to get a visual understanding of how some variables relate to a worker leaving or not.

g <- ggplot(aes(y = satisfaction_level, x = last_evaluation), data = myData)

g + geom_tile(aes(fill = left)) + ggtitle("Satisfaction X Evaluation ")

Here we can see three distinct groups, ones that left with evluation below 0.6 and satisfaction below 0.5, these could be low productivity workers that didn’t enjoy their work and hence left. The second group is a less compact, it is composed of high-evaluation high-satisfaction workers, these could be high skill hard workers that moved on to a better position. The final group comprises highly evaluated workers that were extremly unhappy with work, it’s no surprise they left.

Let’s “zoom” in these groups and look at their basic stastistics.

left_data <- myData[which(myData$left == 1),]
group1<-  left_data[which(left_data$satisfaction_level< 0.5 & left_data$satisfaction_level > 0.25),]
summary(group1)
##  satisfaction_level last_evaluation  number_project  average_montly_hours
##  Min.   :0.260      Min.   :0.4500   Min.   :2.000   Min.   :126         
##  1st Qu.:0.380      1st Qu.:0.4800   1st Qu.:2.000   1st Qu.:135         
##  Median :0.410      Median :0.5100   Median :2.000   Median :145         
##  Mean   :0.406      Mean   :0.5222   Mean   :2.132   Mean   :148         
##  3rd Qu.:0.430      3rd Qu.:0.5500   3rd Qu.:2.000   3rd Qu.:154         
##  Max.   :0.490      Max.   :1.0000   Max.   :7.000   Max.   :310         
##                                                                          
##  time_spend_company Work_accident    left     promotion_last_5years
##  Min.   :2.000      Min.   :0.0000   0:   0   Min.   :0.00000      
##  1st Qu.:3.000      1st Qu.:0.0000   1:1613   1st Qu.:0.00000      
##  Median :3.000      Median :0.0000            Median :0.00000      
##  Mean   :3.049      Mean   :0.0465            Mean   :0.00744      
##  3rd Qu.:3.000      3rd Qu.:0.0000            3rd Qu.:0.00000      
##  Max.   :6.000      Max.   :1.0000            Max.   :1.00000      
##                                                                    
##        sales        salary   
##  sales    :487   high  : 48  
##  technical:285   low   :998  
##  support  :244   medium:567  
##  IT       :111               
##  hr       :109               
##  marketing:107               
##  (Other)  :270
group2 <-  left_data[which(left_data$satisfaction_level > 0.7 & left_data$last_evaluation > 0.8),]
summary(group2)
##  satisfaction_level last_evaluation  number_project  average_montly_hours
##  Min.   :0.7100     Min.   :0.8100   Min.   :2.000   Min.   :130.0       
##  1st Qu.:0.7700     1st Qu.:0.8700   1st Qu.:4.000   1st Qu.:232.0       
##  Median :0.8200     Median :0.9200   Median :5.000   Median :245.0       
##  Mean   :0.8194     Mean   :0.9243   Mean   :4.571   Mean   :244.8       
##  3rd Qu.:0.8700     3rd Qu.:0.9800   3rd Qu.:5.000   3rd Qu.:259.0       
##  Max.   :0.9200     Max.   :1.0000   Max.   :6.000   Max.   :307.0       
##                                                                          
##  time_spend_company Work_accident     left    promotion_last_5years
##  Min.   :2.000      Min.   :0.00000   0:  0   Min.   :0.000000     
##  1st Qu.:5.000      1st Qu.:0.00000   1:890   1st Qu.:0.000000     
##  Median :5.000      Median :0.00000           Median :0.000000     
##  Mean   :5.157      Mean   :0.05169           Mean   :0.001124     
##  3rd Qu.:5.000      3rd Qu.:0.00000           3rd Qu.:0.000000     
##  Max.   :6.000      Max.   :1.00000           Max.   :1.000000     
##                                                                    
##          sales        salary   
##  sales      :256   high  : 12  
##  technical  :169   low   :539  
##  support    :149   medium:339  
##  product_mng: 65               
##  IT         : 64               
##  marketing  : 50               
##  (Other)    :137
group3 <-  left_data[which(left_data$satisfaction_level < 0.2 & left_data$last_evaluation > 0.7),]
summary(group3)
##  satisfaction_level last_evaluation number_project  average_montly_hours
##  Min.   :0.0900     Min.   :0.730   Min.   :3.000   Min.   :136.0       
##  1st Qu.:0.1000     1st Qu.:0.820   1st Qu.:6.000   1st Qu.:259.0       
##  Median :0.1000     Median :0.870   Median :6.000   Median :278.0       
##  Mean   :0.1025     Mean   :0.872   Mean   :6.184   Mean   :275.8       
##  3rd Qu.:0.1100     3rd Qu.:0.930   3rd Qu.:7.000   3rd Qu.:293.0       
##  Max.   :0.1900     Max.   :1.000   Max.   :7.000   Max.   :310.0       
##                                                                         
##  time_spend_company Work_accident     left    promotion_last_5years
##  Min.   :2.000      Min.   :0.00000   0:  0   Min.   :0.000000     
##  1st Qu.:4.000      1st Qu.:0.00000   1:897   1st Qu.:0.000000     
##  Median :4.000      Median :0.00000           Median :0.000000     
##  Mean   :4.086      Mean   :0.04571           Mean   :0.003344     
##  3rd Qu.:4.000      3rd Qu.:0.00000           3rd Qu.:0.000000     
##  Max.   :5.000      Max.   :1.00000           Max.   :1.000000     
##                                                                    
##         sales        salary   
##  sales     :231   high  : 15  
##  technical :200   low   :536  
##  support   :134   medium:346  
##  IT        : 86               
##  accounting: 55               
##  hr        : 48               
##  (Other)   :143

There’s nothing really striking about these groups, their overall distributions are not too different from the original population. It’s noteworthy that in group 1 there’s a higher proportion of high salary earners than in the total population, which is puzzling, as their evaluations are not so good, they are unhappy at work but are more likely to belong to high earners.

Next, let’s look at time spent working.

gg <- ggplot(aes(y = time_spend_company, x = average_montly_hours), data = myData)
gg + geom_tile(aes(fill = left)) + ggtitle("Time at Company X Average Monthly Hours")

The takeaway from the last graph is that no worker left the company after reaching 6 years on the job.

At this point we already gathered some insights that will be useful for feature engineering. But first, let’s fit a few models to the original data.

Model Fitting 1

We fit a few models to this data. Note that the proportion of workers staying is around 76%, so a model always predicting workers to stay would be right, on average, 76% of the time.

The models used are Logistic Boost, LVQ and Linear Extreme Gradient Boost. There’s not much behind this choice, the idea is to experiment with models other than random forest and other popular choices (although Gradient Boost models are extremely popular). Logistic Boost seems like an obvious choice for someone coming from a Linear Regression -> Logistic Regression path.

It’s not in the scope of this note to detail the working and tuning of these models. There are many online references detailing these models and how to fine tune them.

First we set up the train control in caret.

myCtrl <- trainControl(method = "repeatedcv", number = 10, repeats = 3)

Logit Boost

set.seed(314)
fit_LogitBoost <- train(left~., data = myData, method = "LogitBoost")
## Loading required package: caTools

LVQ

Since the LVQ algorithm is distance dependent, the variables will be scaled, so they are constrained between 0 and 1. This is to avoid one attribute from dominating the distance measure.

set.seed(314)
fit_lvq <- train(left~., data = myData, method = "lvq", preProcess = "scale")

XGB

set.seed(314)
fit_xgb <- train(left~., data = myData, method = "xgbLinear")
## Loading required package: xgboost

Model Comparison

resamps <- resamples(list(Logit_Boost = fit_LogitBoost,
                          LVQ = fit_lvq,
                          XGB = fit_xgb))
summary(resamps)
## 
## Call:
## summary.resamples(object = resamps)
## 
## Models: Logit_Boost, LVQ, XGB 
## Number of resamples: 25 
## 
## Accuracy 
##                  Min.   1st Qu.    Median      Mean   3rd Qu.      Max.
## Logit_Boost 0.9258590 0.9429929 0.9450368 0.9457732 0.9502262 0.9582437
## LVQ         0.7687500 0.8143270 0.8397551 0.8336614 0.8530317 0.8733007
## XGB         0.9861213 0.9874363 0.9880326 0.9878979 0.9883805 0.9900650
##             NA's
## Logit_Boost    0
## LVQ            0
## XGB            0
## 
## Kappa 
##                  Min.   1st Qu.    Median      Mean   3rd Qu.      Max.
## Logit_Boost 0.7946209 0.8396185 0.8502140 0.8498578 0.8617968 0.8856734
## LVQ         0.1484379 0.3923340 0.4858787 0.4567644 0.5413370 0.6130312
## XGB         0.9616299 0.9648085 0.9667856 0.9663863 0.9673814 0.9729447
##             NA's
## Logit_Boost    0
## LVQ            0
## XGB            0

The Linear Gradient Boost method provides the best fit, with the Logit Boost as a close second. Although GBM is more accurate, it is somewhat slower than the Logit Boost on this data set.

Feature Engineering

Now some new features are created by transforming the original ones to see by how much we can improve accuracy. The insights for the transformations come mostly from the exploratiry analysis conducted earlier in this note.

myData$satisfaction_above <- ifelse(myData$satisfaction_level > mean(myData$satisfaction_level), 1,0)
myData$last_evaluation_above <- ifelse(myData$last_evaluation > mean(myData$last_evaluation), 1, 0)
myData$project_time <- myData$number_project/myData$time_spend_company
myData$satisfaction_both <- myData$satisfaction_level*myData$last_evaluation
myData$satisfied_acc <- myData$satisfaction_above*myData$Work_accident
myData$max_time <- ifelse(myData$time_spend_company > 6,1,0)
myData$min_time <- ifelse(myData$time_spend_company <= 2.5, 1, 0)

Model Fitting 2

Logit Boost

set.seed(314)
fit_LogitBoost_FE <- train(left~., data = myData, method = "LogitBoost")

LVQ

set.seed(314)
fit_lvq_FE <- train(left~., data = myData, method = "lvq", preProcess = "scale")

XGB

set.seed(314)
fit_xgb_FE <- train(left~., data = myData, method = "xgbLinear")

Model Comparison

resamps <- resamples(list(Logit_Boost = fit_LogitBoost,
                          Logit_Boost_FE = fit_LogitBoost_FE,
                          LVQ = fit_lvq,
                          LVQ_FE = fit_lvq_FE,
                          XGB = fit_xgb,
                          XGB_FE = fit_xgb_FE))
summary(resamps)
## 
## Call:
## summary.resamples(object = resamps)
## 
## Models: Logit_Boost, Logit_Boost_FE, LVQ, LVQ_FE, XGB, XGB_FE 
## Number of resamples: 25 
## 
## Accuracy 
##                     Min.   1st Qu.    Median      Mean   3rd Qu.      Max.
## Logit_Boost    0.9258590 0.9429929 0.9450368 0.9457732 0.9502262 0.9582437
## Logit_Boost_FE 0.9419530 0.9500452 0.9561356 0.9548343 0.9590958 0.9671243
## LVQ            0.7687500 0.8143270 0.8397551 0.8336614 0.8530317 0.8733007
## LVQ_FE         0.8374552 0.8718978 0.8827799 0.8822741 0.9006335 0.9128676
## XGB            0.9861213 0.9874363 0.9880326 0.9878979 0.9883805 0.9900650
## XGB_FE         0.9857014 0.9868870 0.9883995 0.9880074 0.9889633 0.9902103
##                NA's
## Logit_Boost       0
## Logit_Boost_FE    0
## LVQ               0
## LVQ_FE            0
## XGB               0
## XGB_FE            0
## 
## Kappa 
##                     Min.   1st Qu.    Median      Mean   3rd Qu.      Max.
## Logit_Boost    0.7946209 0.8396185 0.8502140 0.8498578 0.8617968 0.8856734
## Logit_Boost_FE 0.8438438 0.8594532 0.8775766 0.8745637 0.8865155 0.9090738
## LVQ            0.1484379 0.3923340 0.4858787 0.4567644 0.5413370 0.6130312
## LVQ_FE         0.4891080 0.6052184 0.6778070 0.6584821 0.7239492 0.7580227
## XGB            0.9616299 0.9648085 0.9667856 0.9663863 0.9673814 0.9729447
## XGB_FE         0.9595603 0.9631820 0.9676064 0.9666579 0.9696778 0.9727691
##                NA's
## Logit_Boost       0
## Logit_Boost_FE    0
## LVQ               0
## LVQ_FE            0
## XGB               0
## XGB_FE            0

As can be seen, the was a small improvement for the Logit Boost and LVQ models, while the XGB is mostly unaffected by the additional variables.

We can compare the variable importance for the XGB and Logit Boost model. Although we are not doing any form of inference, this helps us understand which variable are more important in predicting a worker leaveing.

varImp(fit_LogitBoost_FE)
## ROC curve variable importance
## 
##                       Importance
## satisfaction_level       100.000
## satisfaction_both         83.269
## project_time              73.690
## satisfaction_above        70.544
## time_spend_company        69.706
## min_time                  53.003
## Work_accident             25.250
## satisfied_acc             19.058
## average_montly_hours      12.113
## max_time                   9.348
## last_evaluation_above      5.458
## salary                     5.189
## sales                      4.846
## number_project             4.556
## promotion_last_5years      3.578
## last_evaluation            0.000
varImp(fit_xgb_FE)
## Loading required package: xgboost
## xgbLinear variable importance
## 
##   only 20 most important variables shown (out of 25)
## 
##                         Overall
## satisfaction_level    100.00000
## time_spend_company     29.61562
## satisfaction_both      21.70272
## number_project         20.81198
## average_montly_hours   15.84453
## last_evaluation        15.49066
## project_time            8.95667
## salarylow               0.93601
## Work_accident           0.59020
## salestechnical          0.53755
## salessales              0.39689
## salarymedium            0.37481
## salessupport            0.32702
## salesIT                 0.17288
## saleshr                 0.12327
## salesmanagement         0.07128
## salesproduct_mng        0.06756
## promotion_last_5years   0.04877
## salesRandD              0.04458
## salesmarketing          0.02649

As expected, satisfaction level is the most important variable for both models. But the Logit Boost gives high importance to the newly created variables, while the XGB does not. This also hints at how different the models are.

Conclusion

With a few graphs we were able to gain some insights from the data that were not clear at first. Fitting a few models we were able to get predictions of workers leaving or not with high accuracy. For these models feature engineering was not extremely helpful, but it doesn’t mean that these (or new) transformations wouldn’t be useful for different models.