Dataset taken from https://www.kaggle.com/ludobenistant/hr-analytics/data
Our challenge would be to use a machine learning method, Decision Trees. In order to get some possible reason/s why employees are leaving.
## Parsed with column specification:
## cols(
## satisfaction_level = col_double(),
## last_evaluation = col_double(),
## number_project = col_integer(),
## average_montly_hours = col_integer(),
## time_spend_company = col_integer(),
## Work_accident = col_integer(),
## left = col_integer(),
## promotion_last_5years = col_integer(),
## sales = col_character(),
## salary = col_character()
## )
## Observations: 14,999
## Variables: 10
## $ satisfaction_level <dbl> 0.38, 0.80, 0.11, 0.72, 0.37, 0.41, 0.10...
## $ last_evaluation <dbl> 0.53, 0.86, 0.88, 0.87, 0.52, 0.50, 0.77...
## $ number_project <int> 2, 5, 7, 5, 2, 2, 6, 5, 5, 2, 2, 6, 4, 2...
## $ average_montly_hours <int> 157, 262, 272, 223, 159, 153, 247, 259, ...
## $ time_spend_company <int> 3, 6, 4, 5, 3, 3, 4, 5, 5, 3, 3, 4, 5, 3...
## $ Work_accident <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ left <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ promotion_last_5years <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ sales <chr> "sales", "sales", "sales", "sales", "sal...
## $ salary <chr> "low", "medium", "medium", "low", "low",...
satisfaction_level: A numeric evaluation, maybe written by the employee.last_evaluation: A numeric evaluation, maybe graded by the employee’s manager.number_project: A integer - the number of projects the employee has been involved.average_monthly_hours: The number of hours they work (billed) in the month.time_spend_company: An integer value, length of service of the employee.Work_accident: Boolean value, perhaps whether or not they had an accident.left: Looks like a boolean value, leave or not.promoted_last_5years: Looks like a boolean value.sales: Perhaps this is a department(teams), as to were the employees are assign.salary: A 3-level pay grade indicator (low, medium, high)ggplot(salary, aes(x=reorder(salary, Total), Total)) +
geom_bar(fill = "#c91f01", stat = "identity") + xlab("Salary Levels") +
geom_text(aes(label=Total), vjust=1.6, color="white",
position = position_dodge(0.9), size=3.5) + mytheme()
Only a small number of people are receiving high pay grades. It’s interesting to see that there is only small difference with employee’s that are in medium and low pay grades.
ggplot(sales, aes(x=reorder(sales, Total), Total)) +
geom_bar(fill = "blue", stat = "identity") + xlab("Company Departments") +
geom_text(aes(label=Total), vjust=1.6, color="white",
position = position_dodge(0.9), size=3.5) + mytheme()
ggplot(by_salary_sale, aes(x=reorder(sales, Total), Total)) +
geom_bar(fill = "red", stat = "identity") + xlab("Salary Distribution Per Department") +
geom_text(aes(label=Total), vjust=1.6, color="white",
position = position_dodge(0.9), size=3.5) + mytheme() + facet_wrap(~ by_salary_sale$salary)
It looks like our high pay grade bars shows the number of employees belonging in top management(per each department) of the company. The low pay grade bars show the main work force(per each department) of the company as it contains the highest number of workers. Our medium pay grade bars show the employees belonging to mid-level supervisory roles.
ggplot(hr, aes(factor(promotion_last_5years))) +
geom_bar(fill = "brown") +
mytheme() +
xlab("Promoted during last 5 years")
ggplot(by_promote_sales, aes(x=reorder(promotion_last_5years, Total), Total)) +
geom_bar(fill = "maroon", stat = "identity") + xlab("Employees Promoted Per Department for last 5 years") +
geom_text(aes(label=Total), vjust=1.6, color="white",
position = position_dodge(0.9), size=3.5) + mytheme() + facet_wrap(~by_promote_sales$sales)
Talking about promotions! We can see that very few employees have been promoted. No employees were promoted at product management departments!
ggplot(year_spend,aes(x=time_spend_company ,y=employee_count,color= sales))+geom_line()+geom_point()+xlab("Year")+ylab("No of Employees") +theme_classic()
We could see a downward trend after 3 years in the company. Most of the employees work for 3 years then leave the company afterwards. As employees start to leave one by one, the line goes downward each year. Only few will stay, perhaps employees from top management of the company.
ggplot(hr, aes(salary, average_montly_hours, fill = salary)) +
geom_bar(stat = "identity") + xlab("Pay Grade, Average work-hrs/month and Teams") +
mytheme() + facet_grid(~sales) +scale_fill_manual(values = c("#593420","#080c26","#7d8228"))
We could see that sales team(low and medium pay grades) are working the highest hours followed by support team(low and medium pay grades) then technical team(low and medium pay grades). High level pay grades(all teams) tend to work the lowest hours.
ggplot(sat, aes(satisfaction_level)) +
geom_density(fill = "red") +
facet_wrap( ~ left, ncol = 2)
A. Employees who don’t “like” the company have satisfaction < 0.2
B. Employees who are “not satisfied” with the company have satisfaction < 0.5
C. Good employees who leave the company to seek more “greener pastures” have satisfaction > 0.6
ggplot(rate, aes(last_evaluation)) +
geom_density(fill = "red") +
facet_wrap( ~ left, ncol = 2)
That’s amazing!. We can see pretty close figures. Perhaps the employees who leave are either:
Likely worse at their job. Check that the lowest end of the scale is 0.4.,
Super good at their job, probably leaving the company for more “better” opportunities.
First we will separate our data into test and train.
n <- nrow(hr)
idx <- sample(n, n * .66)
hr %>%
mutate(
left = factor(left, labels = c("Remain", "Left")),
salary = ordered(salary, c("low", "medium", "high"))
) ->
tree_model
train <- tree_model[idx, ]
test <- tree_model[-idx, ]
Then we’ll train a single decision tree using rpart to and evaluate to see how good our fit is.
tree <- rpart(left ~ ., data = train)
res <- predict(tree, test)
auc(as.numeric(test$left) - 1, res[, 2])
## [1] 0.9706983
rpart.plot(tree, type = 2, fallen.leaves = F, extra = 2)
Satisfaction level seems to be the most crucial variable. If employee is above 0.46 , they’re much more likely to stay (which is what we observed above).
If employees have low satisfaction, the number of projects becomes crucial to them. If the employees have more projects, they’re more likely to remain. If they have fewer projects, they leave.
If employees are happy, have been at the company for less than 4.5 years, and score over 81% during their last evaluation, they are very likely to leave. And, it appears that the “decider” is monthly hours over 216.
Let’s make another tree and try to limit the variables. We will choose 3 variables as such , satisfaction_level, last_evaluation, and average_monthly_hours.
tree <- rpart(left ~ satisfaction_level + last_evaluation + average_montly_hours, data = train)
res <- predict(tree, test)
auc(as.numeric(test$left) - 1, res[, 2])
## [1] 0.9589022
rpart.plot(tree, type = 2, fallen.leaves = F, cex = 0.5, extra = 2)
Unfortunately, the results are likely thesame – if employees are good and overworked, they leave; if they are unhappy then, they will likely leave, especially if they are not getting enough work.
successful and overworked, they leave.unhappy and overworked, they leave.unhappy and underworked, they leave.more than 6.5 years, they are more likely to be happy, and fine to work long hours.This is just an example of how an HR Analytics can be utilised on a company. The results will then be brain-stormed and hopefully, effective methods may be implemented by the company in order to retain top talents and keep them from leaving.
Have a nice day!