We work with the human resources data available on Kaggle. This is a mock dataset, so results may not be realistic. The idea is to test 3 different machine learning algorithms to predict if a worker will leave or not, this may not be the best use of this dataset, but it’s a very clean one on which to run algorithms. Before ftting the models a basic exploratory analysis is conducted.
We begin by doing some basic exploring on the data set.
Load the libraries and the data.
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(ggplot2)
myData <- read.csv("HR_comma_sep.csv")
Let’s take a quick look at the data
str(myData)
## 'data.frame': 14999 obs. of 10 variables:
## $ satisfaction_level : num 0.38 0.8 0.11 0.72 0.37 0.41 0.1 0.92 0.89 0.42 ...
## $ last_evaluation : num 0.53 0.86 0.88 0.87 0.52 0.5 0.77 0.85 1 0.53 ...
## $ number_project : int 2 5 7 5 2 2 6 5 5 2 ...
## $ average_montly_hours : int 157 262 272 223 159 153 247 259 224 142 ...
## $ time_spend_company : int 3 6 4 5 3 3 4 5 5 3 ...
## $ Work_accident : int 0 0 0 0 0 0 0 0 0 0 ...
## $ left : int 1 1 1 1 1 1 1 1 1 1 ...
## $ promotion_last_5years: int 0 0 0 0 0 0 0 0 0 0 ...
## $ sales : Factor w/ 10 levels "accounting","hr",..: 8 8 8 8 8 8 8 8 8 8 ...
## $ salary : Factor w/ 3 levels "high","low","medium": 2 3 3 2 2 2 2 2 2 2 ...
myData$left <- as.factor(myData$left)
sum(is.na(myData))
## [1] 0
summary(myData)
## satisfaction_level last_evaluation number_project average_montly_hours
## Min. :0.0900 Min. :0.3600 Min. :2.000 Min. : 96.0
## 1st Qu.:0.4400 1st Qu.:0.5600 1st Qu.:3.000 1st Qu.:156.0
## Median :0.6400 Median :0.7200 Median :4.000 Median :200.0
## Mean :0.6128 Mean :0.7161 Mean :3.803 Mean :201.1
## 3rd Qu.:0.8200 3rd Qu.:0.8700 3rd Qu.:5.000 3rd Qu.:245.0
## Max. :1.0000 Max. :1.0000 Max. :7.000 Max. :310.0
##
## time_spend_company Work_accident left promotion_last_5years
## Min. : 2.000 Min. :0.0000 0:11428 Min. :0.00000
## 1st Qu.: 3.000 1st Qu.:0.0000 1: 3571 1st Qu.:0.00000
## Median : 3.000 Median :0.0000 Median :0.00000
## Mean : 3.498 Mean :0.1446 Mean :0.02127
## 3rd Qu.: 4.000 3rd Qu.:0.0000 3rd Qu.:0.00000
## Max. :10.000 Max. :1.0000 Max. :1.00000
##
## sales salary
## sales :4140 high :1237
## technical :2720 low :7316
## support :2229 medium:6446
## IT :1227
## product_mng: 902
## marketing : 858
## (Other) :2923
The main interest is in the variable “left”, where a 1 indicates that the employee left.
table(myData$left)
##
## 0 1
## 11428 3571
We can see that roughly 30% of the workers left.
Let’s take a look at the distribution of a few other variables.
table(myData$sales)
##
## accounting hr IT management marketing product_mng
## 767 739 1227 630 858 902
## RandD sales support technical
## 787 4140 2229 2720
hist(myData$average_montly_hours, main = "Average Monthly Hours")
Now let’s try to get a visual understanding of how some variables relate to a worker leaving or not.
g <- ggplot(aes(y = satisfaction_level, x = last_evaluation), data = myData)
g + geom_tile(aes(fill = left)) + ggtitle("Satisfaction X Evaluation ")
Here we can see three distinct groups, ones that left with evluation below 0.6 and satisfaction below 0.5, these could be low productivity workers that didn’t enjoy their work and hence left. The second group is a less compact, it is composed of high-evaluation high-satisfaction workers, these could be high skill hard workers that moved on to a better position. The final group comprises highly evaluated workers that were extremly unhappy with work, it’s no surprise they left.
Let’s “zoom” in these groups and look at their basic stastistics.
left_data <- myData[which(myData$left == 1),]
group1<- left_data[which(left_data$satisfaction_level< 0.5 & left_data$satisfaction_level > 0.25),]
summary(group1)
## satisfaction_level last_evaluation number_project average_montly_hours
## Min. :0.260 Min. :0.4500 Min. :2.000 Min. :126
## 1st Qu.:0.380 1st Qu.:0.4800 1st Qu.:2.000 1st Qu.:135
## Median :0.410 Median :0.5100 Median :2.000 Median :145
## Mean :0.406 Mean :0.5222 Mean :2.132 Mean :148
## 3rd Qu.:0.430 3rd Qu.:0.5500 3rd Qu.:2.000 3rd Qu.:154
## Max. :0.490 Max. :1.0000 Max. :7.000 Max. :310
##
## time_spend_company Work_accident left promotion_last_5years
## Min. :2.000 Min. :0.0000 0: 0 Min. :0.00000
## 1st Qu.:3.000 1st Qu.:0.0000 1:1613 1st Qu.:0.00000
## Median :3.000 Median :0.0000 Median :0.00000
## Mean :3.049 Mean :0.0465 Mean :0.00744
## 3rd Qu.:3.000 3rd Qu.:0.0000 3rd Qu.:0.00000
## Max. :6.000 Max. :1.0000 Max. :1.00000
##
## sales salary
## sales :487 high : 48
## technical:285 low :998
## support :244 medium:567
## IT :111
## hr :109
## marketing:107
## (Other) :270
group2 <- left_data[which(left_data$satisfaction_level > 0.7 & left_data$last_evaluation > 0.8),]
summary(group2)
## satisfaction_level last_evaluation number_project average_montly_hours
## Min. :0.7100 Min. :0.8100 Min. :2.000 Min. :130.0
## 1st Qu.:0.7700 1st Qu.:0.8700 1st Qu.:4.000 1st Qu.:232.0
## Median :0.8200 Median :0.9200 Median :5.000 Median :245.0
## Mean :0.8194 Mean :0.9243 Mean :4.571 Mean :244.8
## 3rd Qu.:0.8700 3rd Qu.:0.9800 3rd Qu.:5.000 3rd Qu.:259.0
## Max. :0.9200 Max. :1.0000 Max. :6.000 Max. :307.0
##
## time_spend_company Work_accident left promotion_last_5years
## Min. :2.000 Min. :0.00000 0: 0 Min. :0.000000
## 1st Qu.:5.000 1st Qu.:0.00000 1:890 1st Qu.:0.000000
## Median :5.000 Median :0.00000 Median :0.000000
## Mean :5.157 Mean :0.05169 Mean :0.001124
## 3rd Qu.:5.000 3rd Qu.:0.00000 3rd Qu.:0.000000
## Max. :6.000 Max. :1.00000 Max. :1.000000
##
## sales salary
## sales :256 high : 12
## technical :169 low :539
## support :149 medium:339
## product_mng: 65
## IT : 64
## marketing : 50
## (Other) :137
group3 <- left_data[which(left_data$satisfaction_level < 0.2 & left_data$last_evaluation > 0.7),]
summary(group3)
## satisfaction_level last_evaluation number_project average_montly_hours
## Min. :0.0900 Min. :0.730 Min. :3.000 Min. :136.0
## 1st Qu.:0.1000 1st Qu.:0.820 1st Qu.:6.000 1st Qu.:259.0
## Median :0.1000 Median :0.870 Median :6.000 Median :278.0
## Mean :0.1025 Mean :0.872 Mean :6.184 Mean :275.8
## 3rd Qu.:0.1100 3rd Qu.:0.930 3rd Qu.:7.000 3rd Qu.:293.0
## Max. :0.1900 Max. :1.000 Max. :7.000 Max. :310.0
##
## time_spend_company Work_accident left promotion_last_5years
## Min. :2.000 Min. :0.00000 0: 0 Min. :0.000000
## 1st Qu.:4.000 1st Qu.:0.00000 1:897 1st Qu.:0.000000
## Median :4.000 Median :0.00000 Median :0.000000
## Mean :4.086 Mean :0.04571 Mean :0.003344
## 3rd Qu.:4.000 3rd Qu.:0.00000 3rd Qu.:0.000000
## Max. :5.000 Max. :1.00000 Max. :1.000000
##
## sales salary
## sales :231 high : 15
## technical :200 low :536
## support :134 medium:346
## IT : 86
## accounting: 55
## hr : 48
## (Other) :143
There’s nothing really striking about these groups, their overall distributions are not too different from the original population. It’s noteworthy that in group 1 there’s a higher proportion of high salary earners than in the total population, which is puzzling, as their evaluations are not so good, they are unhappy at work but are more likely to belong to high earners.
Next, let’s look at time spent working.
gg <- ggplot(aes(y = time_spend_company, x = average_montly_hours), data = myData)
gg + geom_tile(aes(fill = left)) + ggtitle("Time at Company X Average Monthly Hours")
The takeaway from the last graph is that no worker left the company after reaching 6 years on the job.
At this point we already gathered some insights that will be useful for feature engineering. But first, let’s fit a few models to the original data.
We fit a few models to this data. Note that the proportion of workers staying is around 76%, so a model always predicting workers to stay would be right, on average, 76% of the time.
The models used are Logistic Boost, LVQ and Linear Extreme Gradient Boost. There’s not much behind this choice, the idea is to experiment with models other than random forest and other popular choices (although Gradient Boost models are extremely popular). Logistic Boost seems like an obvious choice for someone coming from a Linear Regression -> Logistic Regression path.
It’s not in the scope of this note to detail the working and tuning of these models. There are many online references detailing these models and how to fine tune them.
First we set up the train control in caret.
myCtrl <- trainControl(method = "repeatedcv", number = 10, repeats = 3)
set.seed(314)
fit_LogitBoost <- train(left~., data = myData, method = "LogitBoost")
## Loading required package: caTools
Since the LVQ algorithm is distance dependent, the variables will be scaled, so they are constrained between 0 and 1. This is to avoid one attribute from dominating the distance measure.
set.seed(314)
fit_lvq <- train(left~., data = myData, method = "lvq", preProcess = "scale")
set.seed(314)
fit_xgb <- train(left~., data = myData, method = "xgbLinear")
## Loading required package: xgboost
resamps <- resamples(list(Logit_Boost = fit_LogitBoost,
LVQ = fit_lvq,
XGB = fit_xgb))
summary(resamps)
##
## Call:
## summary.resamples(object = resamps)
##
## Models: Logit_Boost, LVQ, XGB
## Number of resamples: 25
##
## Accuracy
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## Logit_Boost 0.9258590 0.9429929 0.9450368 0.9457732 0.9502262 0.9582437
## LVQ 0.7687500 0.8143270 0.8397551 0.8336614 0.8530317 0.8733007
## XGB 0.9861213 0.9874363 0.9880326 0.9878979 0.9883805 0.9900650
## NA's
## Logit_Boost 0
## LVQ 0
## XGB 0
##
## Kappa
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## Logit_Boost 0.7946209 0.8396185 0.8502140 0.8498578 0.8617968 0.8856734
## LVQ 0.1484379 0.3923340 0.4858787 0.4567644 0.5413370 0.6130312
## XGB 0.9616299 0.9648085 0.9667856 0.9663863 0.9673814 0.9729447
## NA's
## Logit_Boost 0
## LVQ 0
## XGB 0
The Linear Gradient Boost method provides the best fit, with the Logit Boost as a close second. Although GBM is more accurate, it is somewhat slower than the Logit Boost on this data set.
Now some new features are created by transforming the original ones to see by how much we can improve accuracy. The insights for the transformations come mostly from the exploratiry analysis conducted earlier in this note.
myData$satisfaction_above <- ifelse(myData$satisfaction_level > mean(myData$satisfaction_level), 1,0)
myData$last_evaluation_above <- ifelse(myData$last_evaluation > mean(myData$last_evaluation), 1, 0)
myData$project_time <- myData$number_project/myData$time_spend_company
myData$satisfaction_both <- myData$satisfaction_level*myData$last_evaluation
myData$satisfied_acc <- myData$satisfaction_above*myData$Work_accident
myData$max_time <- ifelse(myData$time_spend_company > 6,1,0)
myData$min_time <- ifelse(myData$time_spend_company <= 2.5, 1, 0)
set.seed(314)
fit_LogitBoost_FE <- train(left~., data = myData, method = "LogitBoost")
set.seed(314)
fit_lvq_FE <- train(left~., data = myData, method = "lvq", preProcess = "scale")
set.seed(314)
fit_xgb_FE <- train(left~., data = myData, method = "xgbLinear")
resamps <- resamples(list(Logit_Boost = fit_LogitBoost,
Logit_Boost_FE = fit_LogitBoost_FE,
LVQ = fit_lvq,
LVQ_FE = fit_lvq_FE,
XGB = fit_xgb,
XGB_FE = fit_xgb_FE))
summary(resamps)
##
## Call:
## summary.resamples(object = resamps)
##
## Models: Logit_Boost, Logit_Boost_FE, LVQ, LVQ_FE, XGB, XGB_FE
## Number of resamples: 25
##
## Accuracy
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## Logit_Boost 0.9258590 0.9429929 0.9450368 0.9457732 0.9502262 0.9582437
## Logit_Boost_FE 0.9419530 0.9500452 0.9561356 0.9548343 0.9590958 0.9671243
## LVQ 0.7687500 0.8143270 0.8397551 0.8336614 0.8530317 0.8733007
## LVQ_FE 0.8374552 0.8718978 0.8827799 0.8822741 0.9006335 0.9128676
## XGB 0.9861213 0.9874363 0.9880326 0.9878979 0.9883805 0.9900650
## XGB_FE 0.9857014 0.9868870 0.9883995 0.9880074 0.9889633 0.9902103
## NA's
## Logit_Boost 0
## Logit_Boost_FE 0
## LVQ 0
## LVQ_FE 0
## XGB 0
## XGB_FE 0
##
## Kappa
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## Logit_Boost 0.7946209 0.8396185 0.8502140 0.8498578 0.8617968 0.8856734
## Logit_Boost_FE 0.8438438 0.8594532 0.8775766 0.8745637 0.8865155 0.9090738
## LVQ 0.1484379 0.3923340 0.4858787 0.4567644 0.5413370 0.6130312
## LVQ_FE 0.4891080 0.6052184 0.6778070 0.6584821 0.7239492 0.7580227
## XGB 0.9616299 0.9648085 0.9667856 0.9663863 0.9673814 0.9729447
## XGB_FE 0.9595603 0.9631820 0.9676064 0.9666579 0.9696778 0.9727691
## NA's
## Logit_Boost 0
## Logit_Boost_FE 0
## LVQ 0
## LVQ_FE 0
## XGB 0
## XGB_FE 0
As can be seen, the was a small improvement for the Logit Boost and LVQ models, while the XGB is mostly unaffected by the additional variables.
We can compare the variable importance for the XGB and Logit Boost model. Although we are not doing any form of inference, this helps us understand which variable are more important in predicting a worker leaveing.
varImp(fit_LogitBoost_FE)
## ROC curve variable importance
##
## Importance
## satisfaction_level 100.000
## satisfaction_both 83.269
## project_time 73.690
## satisfaction_above 70.544
## time_spend_company 69.706
## min_time 53.003
## Work_accident 25.250
## satisfied_acc 19.058
## average_montly_hours 12.113
## max_time 9.348
## last_evaluation_above 5.458
## salary 5.189
## sales 4.846
## number_project 4.556
## promotion_last_5years 3.578
## last_evaluation 0.000
varImp(fit_xgb_FE)
## Loading required package: xgboost
## xgbLinear variable importance
##
## only 20 most important variables shown (out of 25)
##
## Overall
## satisfaction_level 100.00000
## time_spend_company 29.61562
## satisfaction_both 21.70272
## number_project 20.81198
## average_montly_hours 15.84453
## last_evaluation 15.49066
## project_time 8.95667
## salarylow 0.93601
## Work_accident 0.59020
## salestechnical 0.53755
## salessales 0.39689
## salarymedium 0.37481
## salessupport 0.32702
## salesIT 0.17288
## saleshr 0.12327
## salesmanagement 0.07128
## salesproduct_mng 0.06756
## promotion_last_5years 0.04877
## salesRandD 0.04458
## salesmarketing 0.02649
As expected, satisfaction level is the most important variable for both models. But the Logit Boost gives high importance to the newly created variables, while the XGB does not. This also hints at how different the models are.
With a few graphs we were able to gain some insights from the data that were not clear at first. Fitting a few models we were able to get predictions of workers leaving or not with high accuracy. For these models feature engineering was not extremely helpful, but it doesn’t mean that these (or new) transformations wouldn’t be useful for different models.