HR Analytics by logistic regression

Author

Satoshi Matsumoto

Published

December 2, 2023

Modified

January 1, 2024

Introduction

This is a HR analysis report in basis of the assignment from the Business Analytics course by ESSEC Business School. The analysis is implemented according to the phases, ASK, PREPARE, PROCESS, ANALYZE, SHARE, and ACT.

1 Ask

1.1 Data

Data of Human Resource is anonymous data from a big consulting company that the number of employees leaves the firm. The Data is provided by ESSEC Business School for the purpose of the logistic analysis. Through out this analysis, we will check out the correlations and hierarchical clusters in the data set and understand what makes differences with two categories with visualizations.

1.2 Questions

Who are the prioritized employees to retain within the company? As background, it is not realistic way to do follow-up with each one of workforce because of the time limitations.
What does it make the employees’ attrition driven?

2 Prepare

2.1 Set up necessary tools

Load following packages for the analysis work.

Code

# Data manipulation packages
library(dplyr)
library(tidyverse)
library(scales)
library(skimr)
library(broom) #tidy the data table
# Visualization packages
library(ggplot2)
library(plotly)
library(kableExtra)
library(grid)
library(gridExtra)
library(GGally)
# Statistic packages
library(statsr)
library(corrplot)
library(PerformanceAnalytics)
library(caret)

2.2 Data import

Import the data with csv file, “DATA_3.02_HR2.csv” and we see the first 6 rows of the data as follows.

Code

df <- read_csv("DATA_3.02_HR2.csv")
head(df)

# A tibble: 6 × 7
      S   LPE    NP   ANH   TIC Newborn  left
  <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl> <dbl>
1  0.38  0.53     2   157     3       0     1
2  0.8   0.86     5   262     6       0     1
3  0.11  0.88     7   272     4       0     1
4  0.72  0.87     5   223     5       0     1
5  0.37  0.52     2   159     3       0     1
6  0.41  0.5      2   153     3       0     1

Here, we load the the HR dataset as data frame and name as “df”. Accordingly, some variables are not easy to understand the meaning and we find all classes of values are numeric. We take a look at the statistic summary of the dataset.

2.3 Statistic summary

Data summary
Name	df
Number of rows	12000
Number of columns	8
_______________________
Column type frequency:
character	1
numeric	7
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
ID	0	1	1	5	0	12000	0

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
S	1	0.63	0.24	0.09	0.48	0.66	0.82	1	▃▃▇▇▇
LPE	1	0.72	0.17	0.36	0.57	0.72	0.86	1	▂▇▆▇▇
NP	1	3.80	1.16	2.00	3.00	4.00	5.00	7	▇▆▃▁▁
ANH	1	200.44	48.74	96.00	157.00	199.50	243.00	310	▃▇▆▇▂
TIC	1	3.23	1.06	2.00	2.00	3.00	4.00	6	▅▇▃▂▁
Newborn	1	0.15	0.36	0.00	0.00	0.00	0.00	1	▇▁▁▁▂
left	1	0.17	0.37	0.00	0.00	0.00	0.00	1	▇▁▁▁▂

Accordingly, the dataset contains 12000 rows and seven columns with numeric class and ID column is character.There is not any missing data. “Newborn” and “left” are categorical variables because of binary input values.

Please refer to the table of definition for variable names as follows.

Variable	Definition
S	Satisfaction rate of the employee about the company
LPE	Last project evaluation rate
NP	The number of completed projects within the last 12 months
ANH	Average number of hours worked per month
TIC	The time spent in the company
Newborn	Whether he or she had a baby within the last year
left	To indicate whether a employee left, (that is a 1), or stayed (that is a 0)
ID	Unique ID for employees

2.4 Exploratory data analysis

Before we start analysis work, we take a look at the features of variables in the data.

2.4.1 Densisties

For the satisfaction rate, the most of the employees scored more than 0.5 points out of 1.
For the last performance evaluation rate, the most of employees scored over 0.5.
For the monthly average working hours, the most of employees scored between 135 and 270 hours per month.

2.4.2 Numbers of people for numbers of projects and years in company

The most of the employees have done 3 or 4 projects.
3-year-careered employees are the most.

2.4.3 Rate of “Newborn” and “Left”

1850 employees with 15.4% got a new baby during working in the company.
2000 employees with 16.6% left company.

2.4.4 Correlation among variables

In this step, we will apply for the pearson method as statistical approach to find out the strongest correlation with “left” variable.

According to the question, we need to find out the trends of left employees. As we see the correlation table, the following variables drive employees left.

“TIC”, time spent in the company, is the positively impact for the leavers.
“S”, Satisfaction, is the negatively impact for the leavers.

2.4.5 Time Spent vs Attrition rate

Visualize relation between attrition and time in the company Let’s see the effect of one of the most important driver: TIC

Accordingly, by the fourth year, almost of half employees left the company.
In the third year, over 5000 employees left the company.
After the fourth year, the number of attrition is decreased.

2.4.6 Satisfaction vs Attrition rate

According to the correlation analysis, satisfaction to the company is negatively impacted. The values of satisfaction rate has two decimals. To understand easily, we create categories of employee satisfaction ranking.

The chart clearly indicates the lower satisfaction results lead to the high attrition rate.
It needs to take an action of satisfaction for work to improve the attrition rate.

2.4.7 New born vs Attrition rate

“1” indicates the employees who has a baby last year.
Leaving rate because of the baby was just 5.6%. so, having a new baby is not strong correlation with leaving company.

2.4.8 Number of project vs Attrition rate

Lower numbers of projects drive employees left.
More than 6 projects also drive employees left.

2.4.9 Monthly Averrage Working Hours

Over280 hours of average monthly working hours will not be a good sign.

3 PROCESS/ANALYZE

So far, we explored the features of variables, correlations, and visualizations between attrition and driving factors. From the data base, we found 16% of employees left from the company. “Time spent in the company” and “Satisfaction rate” are the main driving factors for the attrition. Then, who will leave a company at the high possibilities? To find out potential leaving employees are also the main purpose of this analysis. To answer to the question, we need to make a prediction model as well as visualizations of correlations of factors.

3.1 Model training

We use sample dataset of “DATA_4.02_HR3.csv” for testing model.

Code

#Load sample data with ca.8% of data for testing.
#df_test <- read_csv("DATA_4.02_HR3.csv")
# convert the newborn to factor variables
#df_test=df_test %>% 
#  mutate(Newborn = as.factor(Newborn))

#head(df_test)

Code

list( summary_glm$coefficient, # Coefficient summary table about logistic regression to check p-value as significance.
      round( 1 - ( summary_glm$deviance / summary_glm$null.deviance ), 4 ) ) #Formula for the pseudo R square value. How much the variances are explained by the model.

[[1]]
                Estimate   Std. Error    z value      Pr(>|z|)
(Intercept) -1.107603517 0.1795189675  -6.169841  6.835862e-10
S           -3.783653903 0.1358840924 -27.844716 1.248138e-170
LPE          0.453757223 0.2037831099   2.226667  2.596951e-02
NP          -0.370137672 0.0297192759 -12.454465  1.322358e-35
ANH          0.003455542 0.0006998283   4.937700  7.904926e-07
TIC          0.611926710 0.0306594942  19.958800  1.256946e-88
Newborn1    -1.557233812 0.1312241103 -11.866979  1.756974e-32

[[2]]
[1] 0.2106

The pseudo R square value is 0.2106. It means only 21 percent of variance can be explained.
The Cause of the problem is from dataset itself.
To Improve this situation, more variables shall be collected for this dataset.
Due to the fact that we have nothing to do, we take to the next part.

3.2 Logistic regression on “left” variable

In the data set, values of leaving company is the binary decision, such as “YES 1” or “NO 0”. In this case, the logistic regression model will be suitable for making a prediction about the probability of leaving company.

3.3 Prediction Model

Code

# prediction
df_train$prediction <- predict( model_glm, newdata = df_train, type = "response" )
df_test$prediction  <- predict( model_glm, newdata = df_test , type = "response" )

# distribution of the prediction score grouped by known outcome
ggplot( df_train, aes( prediction, color = as.factor(left) ) ) + 
geom_density( size = 1 ) +
ggtitle( "Training Set, Predicted Score" ) + 
labs( color = "data" )  +
scale_color_discrete(labels = c("Negative", "Positive"))

“Negative” value for “left” variable is clearly shaped as left skewer.
“Positive” values looks relatively like a left skewer, but it is because of 16% out of total numbers.

As density chart shows, the predicted score is not so much accuracy yet. To make a classification, we apply for a cutoff value. Since the prediction of a logistic regression model is a probability, in order to use it as a classifier, we’ll have to choose a cutoff value as 0.3 tentatively.

Code

# tidy from the broom package
coefficient <- tidy(model_glm)[ , c( "term", "estimate", "statistic" ) ]

# transfrom the coefficient to be in probability format 
coefficient$estimate <- exp( coefficient$estimate )
coefficient

# A tibble: 7 × 3
  term        estimate statistic
  <chr>          <dbl>     <dbl>
1 (Intercept)   0.330      -6.17
2 S             0.0227    -27.8 
3 LPE           1.57        2.23
4 NP            0.691     -12.5 
5 ANH           1.00        4.94
6 TIC           1.84       20.0 
7 Newborn1      0.211     -11.9

When the value of “S” increases, the probability of leaving company will get low by 27.8.
When the value of “TIC” increases, the probability of leaving company will get high by 20.

3.4 Predicting

We have our logistic regression model as “model_glm”, we’ll load in the dataset again with outcome and use the model to predict the probability.

Code

# use the model to predict a unknown outcome data "DATA_4.02_HR3.csv"
df_hr <- read_csv("DATA_4.02_HR3.csv"  )

Rows: 1000 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (6): S, LPE, NP, ANH, TIC, Newborn

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Code

df_hr$Newborn = as.factor(df_hr$Newborn)
# predict
df_hr$prediction <- predict( model_glm, newdata = df_hr, type = "response" )
list( head(df_hr), nrow(df_hr) )

[[1]]
# A tibble: 6 × 7
      S   LPE    NP   ANH   TIC Newborn prediction
  <dbl> <dbl> <dbl> <dbl> <dbl> <fct>        <dbl>
1  0.86  0.69     4   105     4 1           0.0137
2  0.52  0.98     4   209     2 0           0.103 
3  0.84  0.6      5   207     2 0           0.0194
4  0.6   0.65     3   143     2 1           0.0174
5  0.85  0.57     3   227     2 0           0.0404
6  0.82  0.61     4   246     3 0           0.0613

[[2]]
[1] 1000

3.4.1 Cutoff

we can use the cutoff value in the last section to determine who are the person to have an attention.

Code

# cutoff (more than 30% of of probability of attrition)
cutoff = 0.3
df_hr <- df_hr[ df_hr$prediction >= cutoff, ]
list( head(df_hr), nrow(df_hr) )

[[1]]
# A tibble: 6 × 7
      S   LPE    NP   ANH   TIC Newborn prediction
  <dbl> <dbl> <dbl> <dbl> <dbl> <fct>        <dbl>
1  0.36  0.65     5   119     5 0            0.365
2  0.22  0.41     2   248     3 0            0.549
3  0.2   0.75     4   248     3 0            0.423
4  0.95  0.73     3   286     6 0            0.305
5  0.6   0.94     3   272     4 0            0.338
6  0.44  0.6      3   188     6 0            0.671

[[2]]
[1] 119

119 employees are necessary to arrange the interview instead of 1000 people.

4 SHARE/Visualizatoin

4.1 Time and prediction rate of attrition

4.2 Satisfaction and prediction rate or attrition

Lower satisfaction increases the attrition rate.

4.3 Last Project Evaluation and Employee Attrition

Code

df_hr$LPECUT <- cut( df_hr$LPE, breaks = quantile(df_hr$LPE), include.lowest = TRUE )
median_lpe <- df_hr %>% group_by(LPECUT) %>% 
                       summarise( prediction = median(prediction), count = n() )

ggplot( median_lpe, aes( LPECUT, prediction ) ) + 
geom_point( aes( size = count ), color = "#51829B" ) + 
theme( legend.position = "none" ) +
labs( title = "Last Project's Evaluation and Employee Attrition", 
      y = "Attrition Probability", x = "Last Project's Evaluation by Client" )+
  theme_classic()

It seems to be randomly scattered.
So, we will make a prioritizing the probability.

4.4 Meeting list

The employers needs to take an immediate action to the following employees.

Code

df_meeting<- df_hr %>% 
  mutate(ID = row_number(),
         prediction = percent(round(prediction, 4)),
         Newborn = recode(Newborn, `0` = "No", `1` = "Yes")) %>% 
  arrange(desc(prediction)) %>% 
  filter(prediction > 60,
         LPECUT !=  "[0.38,0.545]"  ) %>% # Low evaluation workers are not so important.
  select(ID, prediction,S, LPECUT, NP, ANH, TIC, Newborn)

df_meeting%>%
  kable(col.names = c("ID", "Attrition Probability", "Satisfaction", "Evaluation Range", "Annual Project", "Average Monthly Working Hours", "Time Spent in Company", "Newborn"))%>%
    kable_styling(bootstrap_options = "striped", "hoover")

ID	Attrition Probability	Satisfaction	Evaluation Range	Annual Project	Average Monthly Working Hours	Time Spent in Company	Newborn
67	88.70%	0.27	(0.805,0.99]	2	246	6	No
55	88.27%	0.31	(0.805,0.99]	2	284	6	No
115	87.16%	0.22	(0.805,0.99]	3	251	6	No
48	85.34%	0.15	(0.545,0.68]	4	287	6	No
69	83.99%	0.14	(0.545,0.68]	3	140	6	No
25	83.75%	0.24	(0.805,0.99]	3	209	6	No
19	82.16%	0.15	(0.68,0.805]	3	272	5	No
56	82.07%	0.23	(0.545,0.68]	3	186	6	No
29	81.32%	0.15	(0.545,0.68]	3	257	5	No
32	81.23%	0.12	(0.805,0.99]	5	240	6	No
33	80.74%	0.39	(0.805,0.99]	2	205	6	No
102	80.26%	0.26	(0.545,0.68]	2	265	5	No
66	79.35%	0.31	(0.545,0.68]	3	227	6	No
34	77.42%	0.17	(0.805,0.99]	5	230	6	No
106	76.61%	0.13	(0.545,0.68]	4	260	5	No
76	75.53%	0.25	(0.805,0.99]	5	274	6	No
70	75.05%	0.25	(0.805,0.99]	3	232	5	No
111	74.51%	0.21	(0.805,0.99]	3	180	5	No
7	73.84%	0.53	(0.68,0.805]	2	261	6	No
86	73.55%	0.24	(0.805,0.99]	3	193	5	No
54	72.99%	0.15	(0.805,0.99]	3	266	4	No
83	71.93%	0.35	(0.545,0.68]	4	254	6	No
109	69.00%	0.59	(0.805,0.99]	2	225	6	No
103	68.90%	0.13	(0.545,0.68]	4	156	5	No
98	68.64%	0.20	(0.545,0.68]	5	167	6	No
6	67.05%	0.44	(0.545,0.68]	3	188	6	No
97	66.98%	0.14	(0.68,0.805]	5	238	5	No
51	66.78%	0.23	(0.68,0.805]	6	264	6	No
43	66.44%	0.17	(0.545,0.68]	5	277	5	No
110	66.10%	0.14	(0.805,0.99]	4	273	4	No
119	65.74%	0.20	(0.68,0.805]	5	108	6	No
101	65.09%	0.34	(0.805,0.99]	5	224	6	No
87	65.04%	0.26	(0.545,0.68]	3	142	5	No
35	63.95%	0.13	(0.68,0.805]	6	281	5	No
59	62.42%	0.21	(0.545,0.68]	5	266	5	No
30	62.28%	0.18	(0.545,0.68]	3	197	4	No
71	61.91%	0.14	(0.68,0.805]	6	281	5	No
88	61.70%	0.20	(0.68,0.805]	3	192	4	No
99	60.94%	0.24	(0.545,0.68]	3	237	4	No

39 employees are the high probability and high-scored performers on the list.

5 ACT

5.1 Recommendations

Arrange a meeting in person for listed employees within a month in order to hear their feeling for their jobs and their expectations.
Before the meeting, as a company, we would recommend the career plan and expectations for the employees with following the company policy.

We hope this analysis will help to improve the attrition rate and to support the benefits of their business for long period.