Abstract

Large companies are constantly looking for good employees and to achieve that they make their best when recruiting but they easily forget that good employees needs a “maintenence”

This data contains about 15,000 employees features and a label if the employee left or not. The company is intersted to know: * what defines the employee that left * is it possible to target the most probable employees to leave the company * can we differntiate the ‘unwanted’ employees from those who we want to keep.

About the dataset and the features

The dataset contains 15,000 employee records and 10 variables while 1 of them is the depended value.

The feature are: 1. satisfaction_level - the rate of satisfaction of the employee from his job

  1. last_evaluation - the last evaluation rate given by the employee’s manager

  2. number_project - how many project the employee had simulitanousely

  3. avarage_monthly_hours

  4. time_spend_company - how many years the employee is employed in the company

  5. work_accident - does the employee had a work accident

  6. left - the depeneded variable. An indication to wether or not the employee left

  7. promotion_last_5years - a boolean variable that indicates a pormotion in the last 5 years

  8. sales - does the employee is in the saless departement

  9. salary - a variable that indicates the salary level of the employee

Data Undertanding

Descriptive Statistics

univariate and bivariate analysis

first let’s have a look at the existing features. we want to know thier name, type and how do they look generally.

## 'data.frame':    14999 obs. of  10 variables:
##  $ satisfaction_level   : num  0.38 0.8 0.11 0.72 0.37 0.41 0.1 0.92 0.89 0.42 ...
##  $ last_evaluation      : num  0.53 0.86 0.88 0.87 0.52 0.5 0.77 0.85 1 0.53 ...
##  $ number_project       : int  2 5 7 5 2 2 6 5 5 2 ...
##  $ average_montly_hours : int  157 262 272 223 159 153 247 259 224 142 ...
##  $ time_spend_company   : int  3 6 4 5 3 3 4 5 5 3 ...
##  $ Work_accident        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ left                 : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ promotion_last_5years: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ sales                : Factor w/ 10 levels "accounting","hr",..: 8 8 8 8 8 8 8 8 8 8 ...
##  $ salary               : Factor w/ 3 levels "high","low","medium": 2 3 3 2 2 2 2 2 2 2 ...

We can see we have a 10 variables and 14999 records. we can also see that wee will have to reclassify some of the features like ‘left’ that should be a factor and not int.

Now, let’s look at each variable seprately

satisfaction level

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0900  0.4400  0.6400  0.6128  0.8200  1.0000

We can see that the range is 0.09-1.00 while the mean and median are close arount ~61% while the first and thirs quarter are in 20% range of the median. That means we probably have a negative skew.

let’s see the distrubution and density

we can see that the most of the observations are in the between 0.4 and 1 with a low freqency int the lower numbers. we also can notice that we have a high frequency around 0 value. maybe there a general unsatisfaction in work, so are the ~0 values means anything.

Let’s break the histogram with left or didn’t left

we can see that the employees that didn’t left have almost the same distrubition as before and the leavers have a distrubition focused in the ~0 values and the at the original distrubition.

this is an excpected outcome, but we can also see a higly satisfied workers that leaving too. we should invistigate further

let’s go further to the next feature

last evaluation

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3600  0.5600  0.7200  0.7161  0.8700  1.0000

The range is between 0.36 as minimum value to 1 as maximum. while the median and the mean are in 0.7 range. the first quarter is 0.56 which means we should see a left tail let’s look on the distrubution of last evaluation

Now we can see that the 0-0.56 are very rare and most of the workers evaluation are in the 0.50 and 0.85 zone (we see the peaks), probably because the managers don’t want to evaluate thier workers to drastically good or bad.

Now let’s see how the data is distubuted between leavers and remaining employees.

The peaks in the leavers side are much more obvious now. it might indicate that leavers are tend to be in the zone of bad employees or in the 80% employees. maybe they feels that their evaluation wasn’t fair. the popluation between 0.6-0.8 are much less tend to leave. maybe because they believe this is their true value and does not seek for another job.

subseting the population

first, we need to categorize our population of the employees we want to keep. let us say that employee with more than 82% evaluation grade (which is the 3rd quartile) and have at least 3 years is a worker that we want to keep.

let’s check how many data we have for that kind of population.

##   high    low medium 
##   1237   7316   6446

number of projects

First we need to convert the feature to ordeal factor and call the new feature “number_project_f” and the we will check it’s stats.

##    
##              2          3          4          5          6          7
##   0 0.87730061 0.97327394 0.73250389 0.47996272 0.22632424 0.00000000
##   1 0.12269939 0.02672606 0.26749611 0.52003728 0.77367576 1.00000000

So most of the workers have 3 or 4 projects, some has 2 or 5, few has 6 and even fewer has 7. we can also see that the leavers are grouped in 2 areas, the lowest project amount and they have a relatively big part in the 5-7 project. It might indicate 2 things

  1. employee without project is probably on his way out, the reason could be either for the employer does not want to give him any projects or there employee has no interest in a job with few projects
  2. employees tend to leave when they have a workload.

we should do some bivariate analysis (like number of project, satisfaction in respect to left), but let’s keep on with univariate analysis.

average monthly hours

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    96.0   187.0   235.0   223.5   260.0   310.0

we can see 2 peaks, one at ~150 hours and the other at ~250 hours which means that we have a lot of ‘low workload’ employees or ‘high workload’ employees. the normal workload are exist but not in big numbers.

time spend company

##   10    3    4    5    6    7    8 
##   74 1536 1224  950  349   52   56
##    
##             10          3          4          5          6          7
##   0 1.00000000 0.98372396 0.46813725 0.20421053 0.44412607 1.00000000
##   1 0.00000000 0.01627604 0.53186275 0.79578947 0.55587393 0.00000000
##    
##              8
##   0 1.00000000
##   1 0.00000000

we can see that leavers are mostly spend in the company 3-6 years, after that we can’t see leavers anymore.

work accidents

##    0    1 
## 3694  547
##    
##             0         1
##   0 0.5801299 0.8628885
##   1 0.4198701 0.1371115

it’s pretty obvious that most employees that had a work accident didn’t leave and the other way around. it might be because the employer can’t or don’t want to fire an employee that had a accident at wor.

promotion last 5 years

##    0    1 
## 4161   80
##    
##             0         1
##   0 0.6094689 0.9875000
##   1 0.3905311 0.0125000

since there are only 80 promoted employees, we can’t assume anything of this feature and his influence on the outcome if the employee didn’t had a promotion. However if he had it might mean he won’t leave soon.

Sales

##  accounting          hr          IT  management   marketing product_mng 
##         228         195         353         167         239         255 
##       RandD       sales     support   technical 
##         186        1140         674         804
##    
##     accounting        hr        IT management marketing product_mng
##   0  0.6359649 0.6000000 0.6203966  0.7485030 0.6527197   0.6039216
##   1  0.3640351 0.4000000 0.3796034  0.2514970 0.3472803   0.3960784
##    
##         RandD     sales   support technical
##   0 0.6881720 0.6122807 0.5979228 0.5845771
##   1 0.3118280 0.3877193 0.4020772 0.4154229
##    
##     accounting  hr  IT management marketing product_mng RandD sales
##   0        145 117 219        125       156         154   128   698
##   1         83  78 134         42        83         101    58   442
##    
##     support technical
##   0     403       470
##   1     271       334

we can see that almost all of the dept have a ~60% and 40% remainers and leavers accordingly. We can notice that at the management and R&D dept. there are much less leavers.

Salary

##   high    low medium 
##    268   2176   1797
##    
##           high        low     medium
##   0 0.90298507 0.54917279 0.65553701
##   1 0.09701493 0.45082721 0.34446299
##    
##     high  low medium
##   0  242 1195   1178
##   1   26  981    619

The beahviour here is quite intuitious. The less you paid the more you want to leave.

Final Plots and Summary

OK, now that we understand the basic data behaviour, we will try to get some more insights using multivariate analysis

untill now we notice that the satisfaction level and last evaluation level have a 2 peaks beahviour and it might be interesting to see the corresponding influence with respect to leavers and not leavers.

this is for all of the population

and this is for the ‘wanted’ employees:

we can easly see in the blue graph the employees that we won’t want to keep (last evaluation is low and they also are not satisfied from thier work), obviously those employees should leave for the best for the company and the employee himself.

the next group is the highly evaluated and aren’t satisfied, we might want to keep them but we need to understand why they aren’t satisfied

the third group is the highly evaluated and highly satisfied. we should understand why the want to leave.

in order to understand this behaviour we can try use a decision tree.

other variables

Since the dataset is very tidy and focused there are no other variables we can create.

Decision Tree

We can also see the obvious behaviour of the remaining employees.

## [1] "auc"
## [1] 0.9713122

So, what explains why the ‘best’ employees leave? We can see that:

if you are a lot of time in the company (above of 3.5 years) and workloaded (more than 216 hours and 3.5+ projects) you will leave probably leave (83%).

that means that the satisfaction level is not that important to the good and ‘old’ employees and most of them will leave just because the high workload.

let’s test the same tree for the all population:

## [1] "auc"
## [1] 0.9712321

Reflection

we can’t jump to conclusion about the satisfaction influence of leaving the company but since the satisfaction level is not always reachable data and it already represent some level of ‘want to leave’ attitude it might be a mistake to include it in the formula. we can also see that when we take just the highly evaluated employees, the satisfaction level is not so important. the workload is predicting much more of the reasons to leave rather than the satisfaction level.

At first sight it seemed that the satisfaction level is the primary feautre but as soon that we use the decision tree it is very clear that the workload (hours and projects) is the strongest variable and has the greatest influenc on leaving.

##         pred
##          Remain Left
##   Remain   3735   45
##   Left      104 1116