HR Analytics by logistic regression

Author

Satoshi Matsumoto

Published

December 2, 2023

Modified

January 1, 2024

Introduction

This is a HR analysis report in basis of the assignment from the Business Analytics course by ESSEC Business School. The analysis is implemented according to the phases, ASK, PREPARE, PROCESS, ANALYZE, SHARE, and ACT.

1 Ask

1.1 Data

Data of Human Resource is anonymous data from a big consulting company that the number of employees leaves the firm. The Data is provided by ESSEC Business School for the purpose of the logistic analysis. Through out this analysis, we will check out the correlations and hierarchical clusters in the data set and understand what makes differences with two categories with visualizations.

1.2 Questions

  • Who are the prioritized employees to retain within the company? As background, it is not realistic way to do follow-up with each one of workforce because of the time limitations.
  • What does it make the employees’ attrition driven?

2 Prepare

2.1 Set up necessary tools

Load following packages for the analysis work.

Code
# Data manipulation packages
library(dplyr)
library(tidyverse)
library(scales)
library(skimr)
library(broom) #tidy the data table
# Visualization packages
library(ggplot2)
library(plotly)
library(kableExtra)
library(grid)
library(gridExtra)
library(GGally)
# Statistic packages
library(statsr)
library(corrplot)
library(PerformanceAnalytics)
library(caret)

2.2 Data import

Import the data with csv file, “DATA_3.02_HR2.csv” and we see the first 6 rows of the data as follows.

Code
df <- read_csv("DATA_3.02_HR2.csv")
head(df)
# A tibble: 6 × 7
      S   LPE    NP   ANH   TIC Newborn  left
  <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl> <dbl>
1  0.38  0.53     2   157     3       0     1
2  0.8   0.86     5   262     6       0     1
3  0.11  0.88     7   272     4       0     1
4  0.72  0.87     5   223     5       0     1
5  0.37  0.52     2   159     3       0     1
6  0.41  0.5      2   153     3       0     1

Here, we load the the HR dataset as data frame and name as “df”. Accordingly, some variables are not easy to understand the meaning and we find all classes of values are numeric. We take a look at the statistic summary of the dataset.

2.3 Statistic summary

Data summary
Name df
Number of rows 12000
Number of columns 8
_______________________
Column type frequency:
character 1
numeric 7
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
ID 0 1 1 5 0 12000 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
S 0 1 0.63 0.24 0.09 0.48 0.66 0.82 1 ▃▃▇▇▇
LPE 0 1 0.72 0.17 0.36 0.57 0.72 0.86 1 ▂▇▆▇▇
NP 0 1 3.80 1.16 2.00 3.00 4.00 5.00 7 ▇▆▃▁▁
ANH 0 1 200.44 48.74 96.00 157.00 199.50 243.00 310 ▃▇▆▇▂
TIC 0 1 3.23 1.06 2.00 2.00 3.00 4.00 6 ▅▇▃▂▁
Newborn 0 1 0.15 0.36 0.00 0.00 0.00 0.00 1 ▇▁▁▁▂
left 0 1 0.17 0.37 0.00 0.00 0.00 0.00 1 ▇▁▁▁▂

Accordingly, the dataset contains 12000 rows and seven columns with numeric class and ID column is character.There is not any missing data. “Newborn” and “left” are categorical variables because of binary input values.

Please refer to the table of definition for variable names as follows.

Variable Definition
S Satisfaction rate of the employee about the company
LPE Last project evaluation rate
NP The number of completed projects within the last 12 months
ANH Average number of hours worked per month
TIC The time spent in the company
Newborn Whether he or she had a baby within the last year
left To indicate whether a employee left, (that is a 1), or stayed (that is a 0)
ID Unique ID for employees

2.4 Exploratory data analysis

Before we start analysis work, we take a look at the features of variables in the data.

2.4.1 Densisties

  • For the satisfaction rate, the most of the employees scored more than 0.5 points out of 1.
  • For the last performance evaluation rate, the most of employees scored over 0.5.
  • For the monthly average working hours, the most of employees scored between 135 and 270 hours per month.

2.4.2 Numbers of people for numbers of projects and years in company

  • The most of the employees have done 3 or 4 projects.
  • 3-year-careered employees are the most.

2.4.3 Rate of “Newborn” and “Left”

  • 1850 employees with 15.4% got a new baby during working in the company.
  • 2000 employees with 16.6% left company.

2.4.4 Correlation among variables

In this step, we will apply for the pearson method as statistical approach to find out the strongest correlation with “left” variable.

According to the question, we need to find out the trends of left employees. As we see the correlation table, the following variables drive employees left.

  • “TIC”, time spent in the company, is the positively impact for the leavers.
  • “S”, Satisfaction, is the negatively impact for the leavers.

2.4.5 Time Spent vs Attrition rate

Visualize relation between attrition and time in the company Let’s see the effect of one of the most important driver: TIC

  • Accordingly, by the fourth year, almost of half employees left the company.
  • In the third year, over 5000 employees left the company.
  • After the fourth year, the number of attrition is decreased.

2.4.6 Satisfaction vs Attrition rate

According to the correlation analysis, satisfaction to the company is negatively impacted. The values of satisfaction rate has two decimals. To understand easily, we create categories of employee satisfaction ranking.

  • The chart clearly indicates the lower satisfaction results lead to the high attrition rate.
  • It needs to take an action of satisfaction for work to improve the attrition rate.

2.4.7 New born vs Attrition rate

  • “1” indicates the employees who has a baby last year.
  • Leaving rate because of the baby was just 5.6%. so, having a new baby is not strong correlation with leaving company.

2.4.8 Number of project vs Attrition rate

  • Lower numbers of projects drive employees left.
  • More than 6 projects also drive employees left.

2.4.9 Monthly Averrage Working Hours

  • Over280 hours of average monthly working hours will not be a good sign.

3 PROCESS/ANALYZE

So far, we explored the features of variables, correlations, and visualizations between attrition and driving factors. From the data base, we found 16% of employees left from the company. “Time spent in the company” and “Satisfaction rate” are the main driving factors for the attrition. Then, who will leave a company at the high possibilities? To find out potential leaving employees are also the main purpose of this analysis. To answer to the question, we need to make a prediction model as well as visualizations of correlations of factors.

3.1 Model training

We use sample dataset of “DATA_4.02_HR3.csv” for testing model.

Code
#Load sample data with ca.8% of data for testing.
#df_test <- read_csv("DATA_4.02_HR3.csv")
# convert the newborn to factor variables
#df_test=df_test %>% 
#  mutate(Newborn = as.factor(Newborn))

#head(df_test)
Code
list( summary_glm$coefficient, # Coefficient summary table about logistic regression to check p-value as significance.
      round( 1 - ( summary_glm$deviance / summary_glm$null.deviance ), 4 ) ) #Formula for the pseudo R square value. How much the variances are explained by the model. 
[[1]]
                Estimate   Std. Error    z value      Pr(>|z|)
(Intercept) -1.107603517 0.1795189675  -6.169841  6.835862e-10
S           -3.783653903 0.1358840924 -27.844716 1.248138e-170
LPE          0.453757223 0.2037831099   2.226667  2.596951e-02
NP          -0.370137672 0.0297192759 -12.454465  1.322358e-35
ANH          0.003455542 0.0006998283   4.937700  7.904926e-07
TIC          0.611926710 0.0306594942  19.958800  1.256946e-88
Newborn1    -1.557233812 0.1312241103 -11.866979  1.756974e-32

[[2]]
[1] 0.2106
  • The pseudo R square value is 0.2106. It means only 21 percent of variance can be explained.
  • The Cause of the problem is from dataset itself.
  • To Improve this situation, more variables shall be collected for this dataset.
  • Due to the fact that we have nothing to do, we take to the next part.

3.2 Logistic regression on “left” variable

In the data set, values of leaving company is the binary decision, such as “YES 1” or “NO 0”. In this case, the logistic regression model will be suitable for making a prediction about the probability of leaving company.

3.3 Prediction Model

Code
# prediction
df_train$prediction <- predict( model_glm, newdata = df_train, type = "response" )
df_test$prediction  <- predict( model_glm, newdata = df_test , type = "response" )

# distribution of the prediction score grouped by known outcome
ggplot( df_train, aes( prediction, color = as.factor(left) ) ) + 
geom_density( size = 1 ) +
ggtitle( "Training Set, Predicted Score" ) + 
labs( color = "data" )  +
scale_color_discrete(labels = c("Negative", "Positive"))

  • “Negative” value for “left” variable is clearly shaped as left skewer.
  • “Positive” values looks relatively like a left skewer, but it is because of 16% out of total numbers.

As density chart shows, the predicted score is not so much accuracy yet. To make a classification, we apply for a cutoff value. Since the prediction of a logistic regression model is a probability, in order to use it as a classifier, we’ll have to choose a cutoff value as 0.3 tentatively.

Code
# tidy from the broom package
coefficient <- tidy(model_glm)[ , c( "term", "estimate", "statistic" ) ]

# transfrom the coefficient to be in probability format 
coefficient$estimate <- exp( coefficient$estimate )
coefficient
# A tibble: 7 × 3
  term        estimate statistic
  <chr>          <dbl>     <dbl>
1 (Intercept)   0.330      -6.17
2 S             0.0227    -27.8 
3 LPE           1.57        2.23
4 NP            0.691     -12.5 
5 ANH           1.00        4.94
6 TIC           1.84       20.0 
7 Newborn1      0.211     -11.9 
  • When the value of “S” increases, the probability of leaving company will get low by 27.8.
  • When the value of “TIC” increases, the probability of leaving company will get high by 20.

3.4 Predicting

We have our logistic regression model as “model_glm”, we’ll load in the dataset again with outcome and use the model to predict the probability.

Code
# use the model to predict a unknown outcome data "DATA_4.02_HR3.csv"
df_hr <- read_csv("DATA_4.02_HR3.csv"  )
Rows: 1000 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (6): S, LPE, NP, ANH, TIC, Newborn

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Code
df_hr$Newborn = as.factor(df_hr$Newborn)
# predict
df_hr$prediction <- predict( model_glm, newdata = df_hr, type = "response" )
list( head(df_hr), nrow(df_hr) )
[[1]]
# A tibble: 6 × 7
      S   LPE    NP   ANH   TIC Newborn prediction
  <dbl> <dbl> <dbl> <dbl> <dbl> <fct>        <dbl>
1  0.86  0.69     4   105     4 1           0.0137
2  0.52  0.98     4   209     2 0           0.103 
3  0.84  0.6      5   207     2 0           0.0194
4  0.6   0.65     3   143     2 1           0.0174
5  0.85  0.57     3   227     2 0           0.0404
6  0.82  0.61     4   246     3 0           0.0613

[[2]]
[1] 1000

3.4.1 Cutoff

we can use the cutoff value in the last section to determine who are the person to have an attention.

Code
# cutoff (more than 30% of of probability of attrition)
cutoff = 0.3
df_hr <- df_hr[ df_hr$prediction >= cutoff, ]
list( head(df_hr), nrow(df_hr) )
[[1]]
# A tibble: 6 × 7
      S   LPE    NP   ANH   TIC Newborn prediction
  <dbl> <dbl> <dbl> <dbl> <dbl> <fct>        <dbl>
1  0.36  0.65     5   119     5 0            0.365
2  0.22  0.41     2   248     3 0            0.549
3  0.2   0.75     4   248     3 0            0.423
4  0.95  0.73     3   286     6 0            0.305
5  0.6   0.94     3   272     4 0            0.338
6  0.44  0.6      3   188     6 0            0.671

[[2]]
[1] 119
  • 119 employees are necessary to arrange the interview instead of 1000 people.

4 SHARE/Visualizatoin

4.1 Time and prediction rate of attrition

4.2 Satisfaction and prediction rate or attrition

  • Lower satisfaction increases the attrition rate.

4.3 Last Project Evaluation and Employee Attrition

Code
df_hr$LPECUT <- cut( df_hr$LPE, breaks = quantile(df_hr$LPE), include.lowest = TRUE )
median_lpe <- df_hr %>% group_by(LPECUT) %>% 
                       summarise( prediction = median(prediction), count = n() )

ggplot( median_lpe, aes( LPECUT, prediction ) ) + 
geom_point( aes( size = count ), color = "#51829B" ) + 
theme( legend.position = "none" ) +
labs( title = "Last Project's Evaluation and Employee Attrition", 
      y = "Attrition Probability", x = "Last Project's Evaluation by Client" )+
  theme_classic()

  • It seems to be randomly scattered.
  • So, we will make a prioritizing the probability.

4.4 Meeting list

The employers needs to take an immediate action to the following employees.

Code
df_meeting<- df_hr %>% 
  mutate(ID = row_number(),
         prediction = percent(round(prediction, 4)),
         Newborn = recode(Newborn, `0` = "No", `1` = "Yes")) %>% 
  arrange(desc(prediction)) %>% 
  filter(prediction > 60,
         LPECUT !=  "[0.38,0.545]"  ) %>% # Low evaluation workers are not so important.
  select(ID, prediction,S, LPECUT, NP, ANH, TIC, Newborn)

df_meeting%>%
  kable(col.names = c("ID", "Attrition Probability", "Satisfaction", "Evaluation Range", "Annual Project", "Average Monthly Working Hours", "Time Spent in Company", "Newborn"))%>%
    kable_styling(bootstrap_options = "striped", "hoover")
ID Attrition Probability Satisfaction Evaluation Range Annual Project Average Monthly Working Hours Time Spent in Company Newborn
67 88.70% 0.27 (0.805,0.99] 2 246 6 No
55 88.27% 0.31 (0.805,0.99] 2 284 6 No
115 87.16% 0.22 (0.805,0.99] 3 251 6 No
48 85.34% 0.15 (0.545,0.68] 4 287 6 No
69 83.99% 0.14 (0.545,0.68] 3 140 6 No
25 83.75% 0.24 (0.805,0.99] 3 209 6 No
19 82.16% 0.15 (0.68,0.805] 3 272 5 No
56 82.07% 0.23 (0.545,0.68] 3 186 6 No
29 81.32% 0.15 (0.545,0.68] 3 257 5 No
32 81.23% 0.12 (0.805,0.99] 5 240 6 No
33 80.74% 0.39 (0.805,0.99] 2 205 6 No
102 80.26% 0.26 (0.545,0.68] 2 265 5 No
66 79.35% 0.31 (0.545,0.68] 3 227 6 No
34 77.42% 0.17 (0.805,0.99] 5 230 6 No
106 76.61% 0.13 (0.545,0.68] 4 260 5 No
76 75.53% 0.25 (0.805,0.99] 5 274 6 No
70 75.05% 0.25 (0.805,0.99] 3 232 5 No
111 74.51% 0.21 (0.805,0.99] 3 180 5 No
7 73.84% 0.53 (0.68,0.805] 2 261 6 No
86 73.55% 0.24 (0.805,0.99] 3 193 5 No
54 72.99% 0.15 (0.805,0.99] 3 266 4 No
83 71.93% 0.35 (0.545,0.68] 4 254 6 No
109 69.00% 0.59 (0.805,0.99] 2 225 6 No
103 68.90% 0.13 (0.545,0.68] 4 156 5 No
98 68.64% 0.20 (0.545,0.68] 5 167 6 No
6 67.05% 0.44 (0.545,0.68] 3 188 6 No
97 66.98% 0.14 (0.68,0.805] 5 238 5 No
51 66.78% 0.23 (0.68,0.805] 6 264 6 No
43 66.44% 0.17 (0.545,0.68] 5 277 5 No
110 66.10% 0.14 (0.805,0.99] 4 273 4 No
119 65.74% 0.20 (0.68,0.805] 5 108 6 No
101 65.09% 0.34 (0.805,0.99] 5 224 6 No
87 65.04% 0.26 (0.545,0.68] 3 142 5 No
35 63.95% 0.13 (0.68,0.805] 6 281 5 No
59 62.42% 0.21 (0.545,0.68] 5 266 5 No
30 62.28% 0.18 (0.545,0.68] 3 197 4 No
71 61.91% 0.14 (0.68,0.805] 6 281 5 No
88 61.70% 0.20 (0.68,0.805] 3 192 4 No
99 60.94% 0.24 (0.545,0.68] 3 237 4 No
  • 39 employees are the high probability and high-scored performers on the list.

5 ACT

5.1 Recommendations

  1. Arrange a meeting in person for listed employees within a month in order to hear their feeling for their jobs and their expectations.
  2. Before the meeting, as a company, we would recommend the career plan and expectations for the employees with following the company policy.

We hope this analysis will help to improve the attrition rate and to support the benefits of their business for long period.