HR Analytics

We have a real world dataset which is out of an HR database, it contains employees leaving the company.

Dataset taken from https://www.kaggle.com/ludobenistant/hr-analytics/data

Our challenge would be to use a machine learning method, Decision Trees. In order to get some possible reason/s why employees are leaving.

## Parsed with column specification:
## cols(
##   satisfaction_level = col_double(),
##   last_evaluation = col_double(),
##   number_project = col_integer(),
##   average_montly_hours = col_integer(),
##   time_spend_company = col_integer(),
##   Work_accident = col_integer(),
##   left = col_integer(),
##   promotion_last_5years = col_integer(),
##   sales = col_character(),
##   salary = col_character()
## )

## Observations: 14,999
## Variables: 10
## $ satisfaction_level    <dbl> 0.38, 0.80, 0.11, 0.72, 0.37, 0.41, 0.10...
## $ last_evaluation       <dbl> 0.53, 0.86, 0.88, 0.87, 0.52, 0.50, 0.77...
## $ number_project        <int> 2, 5, 7, 5, 2, 2, 6, 5, 5, 2, 2, 6, 4, 2...
## $ average_montly_hours  <int> 157, 262, 272, 223, 159, 153, 247, 259, ...
## $ time_spend_company    <int> 3, 6, 4, 5, 3, 3, 4, 5, 5, 3, 3, 4, 5, 3...
## $ Work_accident         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ left                  <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ promotion_last_5years <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ sales                 <chr> "sales", "sales", "sales", "sales", "sal...
## $ salary                <chr> "low", "medium", "medium", "low", "low",...

Our dataset contains the following observations

satisfaction_level: A numeric evaluation, maybe written by the employee.
last_evaluation: A numeric evaluation, maybe graded by the employee’s manager.
number_project: A integer - the number of projects the employee has been involved.
average_monthly_hours: The number of hours they work (billed) in the month.
time_spend_company: An integer value, length of service of the employee.
Work_accident: Boolean value, perhaps whether or not they had an accident.
left: Looks like a boolean value, leave or not.
promoted_last_5years: Looks like a boolean value.
sales: Perhaps this is a department(teams), as to were the employees are assign.
salary: A 3-level pay grade indicator (low, medium, high)

Exploratory Data Analysis

ggplot(salary, aes(x=reorder(salary, Total), Total)) + 
  geom_bar(fill = "#c91f01", stat = "identity") + xlab("Salary Levels") +
  geom_text(aes(label=Total), vjust=1.6, color="white",
                                           position = position_dodge(0.9), size=3.5) + mytheme()

Only a small number of people are receiving high pay grades. It’s interesting to see that there is only small difference with employee’s that are in medium and low pay grades.

ggplot(sales, aes(x=reorder(sales, Total), Total)) + 
  geom_bar(fill = "blue", stat = "identity") + xlab("Company Departments") +
  geom_text(aes(label=Total), vjust=1.6, color="white",
            position = position_dodge(0.9), size=3.5) + mytheme()

Relationship between salary and each department(teams)

ggplot(by_salary_sale, aes(x=reorder(sales, Total), Total)) + 
  geom_bar(fill = "red", stat = "identity") + xlab("Salary Distribution Per Department") +
  geom_text(aes(label=Total), vjust=1.6, color="white",
            position = position_dodge(0.9), size=3.5) + mytheme() + facet_wrap(~ by_salary_sale$salary)

It looks like our high pay grade bars shows the number of employees belonging in top management(per each department) of the company. The low pay grade bars show the main work force(per each department) of the company as it contains the highest number of workers. Our medium pay grade bars show the employees belonging to mid-level supervisory roles.

Next, let’s check how employees have been promoted for the last 5 years.

ggplot(hr, aes(factor(promotion_last_5years))) + 
  geom_bar(fill = "brown") +
  mytheme() +
  xlab("Promoted during last 5 years")

It seems that very few of the employees were promoted within last 5 years

Let’s explore relationship between promotion and departments

ggplot(by_promote_sales, aes(x=reorder(promotion_last_5years, Total), Total)) + 
  geom_bar(fill = "maroon", stat = "identity") + xlab("Employees Promoted Per Department for last 5 years") +
  geom_text(aes(label=Total), vjust=1.6, color="white",
            position = position_dodge(0.9), size=3.5) + mytheme() + facet_wrap(~by_promote_sales$sales)

Talking about promotions! We can see that very few employees have been promoted. No employees were promoted at product management departments!

Let’s explore how many years each employee will possibly stay before leaving.

ggplot(year_spend,aes(x=time_spend_company ,y=employee_count,color= sales))+geom_line()+geom_point()+xlab("Year")+ylab("No of Employees") +theme_classic()

We could see a downward trend after 3 years in the company. Most of the employees work for 3 years then leave the company afterwards. As employees start to leave one by one, the line goes downward each year. Only few will stay, perhaps employees from top management of the company.

Let’s explore how many hours each of the departments are working together with their paygrades and average monthly hours

ggplot(hr, aes(salary, average_montly_hours, fill = salary)) + 
  geom_bar(stat = "identity") + xlab("Pay Grade, Average work-hrs/month and Teams") +
  mytheme() + facet_grid(~sales) +scale_fill_manual(values = c("#593420","#080c26","#7d8228"))

We could see that sales team(low and medium pay grades) are working the highest hours followed by support team(low and medium pay grades) then technical team(low and medium pay grades). High level pay grades(all teams) tend to work the lowest hours.

Let’s explore Satisfaction levels

ggplot(sat, aes(satisfaction_level)) +
  geom_density(fill = "red") +
  facet_wrap( ~ left, ncol = 2)

People with low satisfaction seldom stay in the company. Base on the density plot , those who leave the company fall into each of the categories:

A. Employees who don’t “like” the company have satisfaction < 0.2

B. Employees who are “not satisfied” with the company have satisfaction < 0.5

C. Good employees who leave the company to seek more “greener pastures” have satisfaction > 0.6

Let’s check last evaluation grades of the employees.

ggplot(rate, aes(last_evaluation)) +
  geom_density(fill = "red") +
  facet_wrap( ~ left, ncol = 2)

That’s amazing!. We can see pretty close figures. Perhaps the employees who leave are either:

Likely worse at their job. Check that the lowest end of the scale is 0.4.,
Super good at their job, probably leaving the company for more “better” opportunities.

Now, let’s try to see and understand why employees are leaving. In here, we will make a model and apply Decision Trees in machine learning to seek answers.

First we will separate our data into test and train.

n <- nrow(hr)
idx <- sample(n, n * .66)

hr %>% 
mutate(
left = factor(left, labels = c("Remain", "Left")),
salary = ordered(salary, c("low", "medium", "high"))
) -> 
tree_model

train <- tree_model[idx, ]
test <- tree_model[-idx, ]

Then we’ll train a single decision tree using rpart to and evaluate to see how good our fit is.

tree <- rpart(left ~ ., data = train)

res <- predict(tree, test)

auc(as.numeric(test$left) - 1, res[, 2])

## [1] 0.9706983

That’s a high AUC score for a single tree. Let’s investigate further and check.

rpart.plot(tree, type = 2, fallen.leaves = F,  extra = 2)

Base on our tree model, we can observe that:

Satisfaction level seems to be the most crucial variable. If employee is above 0.46 , they’re much more likely to stay (which is what we observed above).
If employees have low satisfaction, the number of projects becomes crucial to them. If the employees have more projects, they’re more likely to remain. If they have fewer projects, they leave.
If employees are happy, have been at the company for less than 4.5 years, and score over 81% during their last evaluation, they are very likely to leave. And, it appears that the “decider” is monthly hours over 216.

In short:

If employees are successful and overworked, they leave.
If employees are unhappy and overworked, they leave.
If employees are unhappy and underworked, they leave.
If employees have been at the company for more than 6.5 years, they are more likely to be happy working longer hours.

Let’s make another tree and try to limit the variables. We will choose 3 variables as such , satisfaction_level, last_evaluation, and average_monthly_hours.

tree <- rpart(left ~ satisfaction_level + last_evaluation + average_montly_hours, data = train)

res <- predict(tree, test)

auc(as.numeric(test$left) - 1, res[, 2])

## [1] 0.9589022

Good!, still an excellent AUC score.

rpart.plot(tree, type = 2, fallen.leaves = F,  cex = 0.5, extra = 2)

Unfortunately, the results are likely thesame – if employees are good and overworked, they leave; if they are unhappy then, they will likely leave, especially if they are not getting enough work.

To sum up everything we have done here:

1. If employees are successful and overworked, they leave.
1. If employees are unhappy and overworked, they leave.
1. If employees are unhappy and underworked, they leave.
1. If employees have been at the company for more than 6.5 years, they are more likely to be happy, and fine to work long hours.
This is just an example of how an HR Analytics can be utilised on a company. The results will then be brain-stormed and hopefully, effective methods may be implemented by the company in order to retain top talents and keep them from leaving.
Thanks for your time. I can be contacted at brenborbon@hotmail.com
Have a nice day!

HR Analytics with Decision Trees

Brennon Borbon

September 27, 2017