Large companies are constantly looking for good employees and to achieve that they make their best when recruiting but they easily forget that good employees needs a “maintenence”
This data contains about 15,000 employees features and a label if the employee left or not. The company is intersted to know: * what defines the employee that left * is it possible to target the most probable employees to leave the company * can we differntiate the ‘unwanted’ employees from those who we want to keep.
The dataset contains 15,000 employee records and 10 variables while 1 of them is the depended value.
The feature are: 1. satisfaction_level - the rate of satisfaction of the employee from his job
last_evaluation - the last evaluation rate given by the employee’s manager
number_project - how many project the employee had simulitanousely
avarage_monthly_hours
time_spend_company - how many years the employee is employed in the company
work_accident - does the employee had a work accident
left - the depeneded variable. An indication to wether or not the employee left
promotion_last_5years - a boolean variable that indicates a pormotion in the last 5 years
sales - does the employee is in the saless departement
salary - a variable that indicates the salary level of the employee
first let’s have a look at the existing features. we want to know thier name, type and how do they look generally.
## 'data.frame': 14999 obs. of 10 variables:
## $ satisfaction_level : num 0.38 0.8 0.11 0.72 0.37 0.41 0.1 0.92 0.89 0.42 ...
## $ last_evaluation : num 0.53 0.86 0.88 0.87 0.52 0.5 0.77 0.85 1 0.53 ...
## $ number_project : int 2 5 7 5 2 2 6 5 5 2 ...
## $ average_montly_hours : int 157 262 272 223 159 153 247 259 224 142 ...
## $ time_spend_company : int 3 6 4 5 3 3 4 5 5 3 ...
## $ Work_accident : int 0 0 0 0 0 0 0 0 0 0 ...
## $ left : int 1 1 1 1 1 1 1 1 1 1 ...
## $ promotion_last_5years: int 0 0 0 0 0 0 0 0 0 0 ...
## $ sales : Factor w/ 10 levels "accounting","hr",..: 8 8 8 8 8 8 8 8 8 8 ...
## $ salary : Factor w/ 3 levels "high","low","medium": 2 3 3 2 2 2 2 2 2 2 ...
We can see we have a 10 variables and 14999 records. we can also see that wee will have to reclassify some of the features like ‘left’ that should be a factor and not int.
Now, let’s look at each variable seprately
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0900 0.4400 0.6400 0.6128 0.8200 1.0000
We can see that the range is 0.09-1.00 while the mean and median are close arount ~61% while the first and thirs quarter are in 20% range of the median. That means we probably have a negative skew.
let’s see the distrubution and density
we can see that the most of the observations are in the between 0.4 and 1 with a low freqency int the lower numbers. we also can notice that we have a high frequency around 0 value. maybe there a general unsatisfaction in work, so are the ~0 values means anything.
Let’s break the histogram with left or didn’t left
we can see that the employees that didn’t left have almost the same distrubition as before and the leavers have a distrubition focused in the ~0 values and the at the original distrubition.
this is an excpected outcome, but we can also see a higly satisfied workers that leaving too. we should invistigate further
let’s go further to the next feature
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3600 0.5600 0.7200 0.7161 0.8700 1.0000
The range is between 0.36 as minimum value to 1 as maximum. while the median and the mean are in 0.7 range. the first quarter is 0.56 which means we should see a left tail let’s look on the distrubution of last evaluation
Now we can see that the 0-0.56 are very rare and most of the workers evaluation are in the 0.50 and 0.85 zone (we see the peaks), probably because the managers don’t want to evaluate thier workers to drastically good or bad.
Now let’s see how the data is distubuted between leavers and remaining employees.
The peaks in the leavers side are much more obvious now. it might indicate that leavers are tend to be in the zone of bad employees or in the 80% employees. maybe they feels that their evaluation wasn’t fair. the popluation between 0.6-0.8 are much less tend to leave. maybe because they believe this is their true value and does not seek for another job.
first, we need to categorize our population of the employees we want to keep. let us say that employee with more than 82% evaluation grade (which is the 3rd quartile) and have at least 3 years is a worker that we want to keep.
let’s check how many data we have for that kind of population.
## high low medium
## 1237 7316 6446
First we need to convert the feature to ordeal factor and call the new feature “number_project_f” and the we will check it’s stats.
##
## 2 3 4 5 6 7
## 0 0.87730061 0.97327394 0.73250389 0.47996272 0.22632424 0.00000000
## 1 0.12269939 0.02672606 0.26749611 0.52003728 0.77367576 1.00000000
So most of the workers have 3 or 4 projects, some has 2 or 5, few has 6 and even fewer has 7. we can also see that the leavers are grouped in 2 areas, the lowest project amount and they have a relatively big part in the 5-7 project. It might indicate 2 things
we should do some bivariate analysis (like number of project, satisfaction in respect to left), but let’s keep on with univariate analysis.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 96.0 187.0 235.0 223.5 260.0 310.0
we can see 2 peaks, one at ~150 hours and the other at ~250 hours which means that we have a lot of ‘low workload’ employees or ‘high workload’ employees. the normal workload are exist but not in big numbers.
## 10 3 4 5 6 7 8
## 74 1536 1224 950 349 52 56
##
## 10 3 4 5 6 7
## 0 1.00000000 0.98372396 0.46813725 0.20421053 0.44412607 1.00000000
## 1 0.00000000 0.01627604 0.53186275 0.79578947 0.55587393 0.00000000
##
## 8
## 0 1.00000000
## 1 0.00000000
we can see that leavers are mostly spend in the company 3-6 years, after that we can’t see leavers anymore.
## 0 1
## 3694 547
##
## 0 1
## 0 0.5801299 0.8628885
## 1 0.4198701 0.1371115
it’s pretty obvious that most employees that had a work accident didn’t leave and the other way around. it might be because the employer can’t or don’t want to fire an employee that had a accident at wor.
## 0 1
## 4161 80
##
## 0 1
## 0 0.6094689 0.9875000
## 1 0.3905311 0.0125000
since there are only 80 promoted employees, we can’t assume anything of this feature and his influence on the outcome if the employee didn’t had a promotion. However if he had it might mean he won’t leave soon.
## accounting hr IT management marketing product_mng
## 228 195 353 167 239 255
## RandD sales support technical
## 186 1140 674 804
##
## accounting hr IT management marketing product_mng
## 0 0.6359649 0.6000000 0.6203966 0.7485030 0.6527197 0.6039216
## 1 0.3640351 0.4000000 0.3796034 0.2514970 0.3472803 0.3960784
##
## RandD sales support technical
## 0 0.6881720 0.6122807 0.5979228 0.5845771
## 1 0.3118280 0.3877193 0.4020772 0.4154229
##
## accounting hr IT management marketing product_mng RandD sales
## 0 145 117 219 125 156 154 128 698
## 1 83 78 134 42 83 101 58 442
##
## support technical
## 0 403 470
## 1 271 334
we can see that almost all of the dept have a ~60% and 40% remainers and leavers accordingly. We can notice that at the management and R&D dept. there are much less leavers.
## high low medium
## 268 2176 1797
##
## high low medium
## 0 0.90298507 0.54917279 0.65553701
## 1 0.09701493 0.45082721 0.34446299
##
## high low medium
## 0 242 1195 1178
## 1 26 981 619
The beahviour here is quite intuitious. The less you paid the more you want to leave.
OK, now that we understand the basic data behaviour, we will try to get some more insights using multivariate analysis
untill now we notice that the satisfaction level and last evaluation level have a 2 peaks beahviour and it might be interesting to see the corresponding influence with respect to leavers and not leavers.
this is for all of the population
and this is for the ‘wanted’ employees:
we can easly see in the blue graph the employees that we won’t want to keep (last evaluation is low and they also are not satisfied from thier work), obviously those employees should leave for the best for the company and the employee himself.
the next group is the highly evaluated and aren’t satisfied, we might want to keep them but we need to understand why they aren’t satisfied
the third group is the highly evaluated and highly satisfied. we should understand why the want to leave.
in order to understand this behaviour we can try use a decision tree.
Since the dataset is very tidy and focused there are no other variables we can create.
We can also see the obvious behaviour of the remaining employees.
## [1] "auc"
## [1] 0.9713122
So, what explains why the ‘best’ employees leave? We can see that:
if you are a lot of time in the company (above of 3.5 years) and workloaded (more than 216 hours and 3.5+ projects) you will leave probably leave (83%).
that means that the satisfaction level is not that important to the good and ‘old’ employees and most of them will leave just because the high workload.
let’s test the same tree for the all population:
## [1] "auc"
## [1] 0.9712321
we can’t jump to conclusion about the satisfaction influence of leaving the company but since the satisfaction level is not always reachable data and it already represent some level of ‘want to leave’ attitude it might be a mistake to include it in the formula. we can also see that when we take just the highly evaluated employees, the satisfaction level is not so important. the workload is predicting much more of the reasons to leave rather than the satisfaction level.
At first sight it seemed that the satisfaction level is the primary feautre but as soon that we use the decision tree it is very clear that the workload (hours and projects) is the strongest variable and has the greatest influenc on leaving.
## pred
## Remain Left
## Remain 3735 45
## Left 104 1116