This is a HR analysis report in basis of the assignment from the Business Analytics course by ESSEC Business School. The analysis is implemented according to the phases, ASK, PREPARE, PROCESS, ANALYZE, SHARE, and ACT.
1 Ask
1.1 Data
Data of Human Resource is anonymous data from a big consulting company that the number of employees leaves the firm. The Data is provided by ESSEC Business School for the purpose of the logistic analysis. Through out this analysis, we will check out the correlations and hierarchical clusters in the data set and understand what makes differences with two categories with visualizations.
1.2 Questions
Who are the prioritized employees to retain within the company? As background, it is not realistic way to do follow-up with each one of workforce because of the time limitations.
What does it make the employees’ attrition driven?
2 Prepare
2.1 Set up necessary tools
Load following packages for the analysis work.
Code
# Data manipulation packageslibrary(dplyr)library(tidyverse)library(scales)library(skimr)library(broom) #tidy the data table# Visualization packageslibrary(ggplot2)library(plotly)library(kableExtra)library(grid)library(gridExtra)library(GGally)# Statistic packageslibrary(statsr)library(corrplot)library(PerformanceAnalytics)library(caret)
2.2 Data import
Import the data with csv file, “DATA_3.02_HR2.csv” and we see the first 6 rows of the data as follows.
Here, we load the the HR dataset as data frame and name as “df”. Accordingly, some variables are not easy to understand the meaning and we find all classes of values are numeric. We take a look at the statistic summary of the dataset.
2.3 Statistic summary
Data summary
Name
df
Number of rows
12000
Number of columns
8
_______________________
Column type frequency:
character
1
numeric
7
________________________
Group variables
None
Variable type: character
skim_variable
n_missing
complete_rate
min
max
empty
n_unique
whitespace
ID
0
1
1
5
0
12000
0
Variable type: numeric
skim_variable
n_missing
complete_rate
mean
sd
p0
p25
p50
p75
p100
hist
S
0
1
0.63
0.24
0.09
0.48
0.66
0.82
1
▃▃▇▇▇
LPE
0
1
0.72
0.17
0.36
0.57
0.72
0.86
1
▂▇▆▇▇
NP
0
1
3.80
1.16
2.00
3.00
4.00
5.00
7
▇▆▃▁▁
ANH
0
1
200.44
48.74
96.00
157.00
199.50
243.00
310
▃▇▆▇▂
TIC
0
1
3.23
1.06
2.00
2.00
3.00
4.00
6
▅▇▃▂▁
Newborn
0
1
0.15
0.36
0.00
0.00
0.00
0.00
1
▇▁▁▁▂
left
0
1
0.17
0.37
0.00
0.00
0.00
0.00
1
▇▁▁▁▂
Accordingly, the dataset contains 12000 rows and seven columns with numeric class and ID column is character.There is not any missing data. “Newborn” and “left” are categorical variables because of binary input values.
Please refer to the table of definition for variable names as follows.
Variable
Definition
S
Satisfaction rate of the employee about the company
LPE
Last project evaluation rate
NP
The number of completed projects within the last 12 months
ANH
Average number of hours worked per month
TIC
The time spent in the company
Newborn
Whether he or she had a baby within the last year
left
To indicate whether a employee left, (that is a 1), or stayed (that is a 0)
ID
Unique ID for employees
2.4 Exploratory data analysis
Before we start analysis work, we take a look at the features of variables in the data.
2.4.1 Densisties
For the satisfaction rate, the most of the employees scored more than 0.5 points out of 1.
For the last performance evaluation rate, the most of employees scored over 0.5.
For the monthly average working hours, the most of employees scored between 135 and 270 hours per month.
2.4.2 Numbers of people for numbers of projects and years in company
The most of the employees have done 3 or 4 projects.
3-year-careered employees are the most.
2.4.3 Rate of “Newborn” and “Left”
1850 employees with 15.4% got a new baby during working in the company.
2000 employees with 16.6% left company.
2.4.4 Correlation among variables
In this step, we will apply for the pearson method as statistical approach to find out the strongest correlation with “left” variable.
According to the question, we need to find out the trends of left employees. As we see the correlation table, the following variables drive employees left.
“TIC”, time spent in the company, is the positively impact for the leavers.
“S”, Satisfaction, is the negatively impact for the leavers.
2.4.5 Time Spent vs Attrition rate
Visualize relation between attrition and time in the company Let’s see the effect of one of the most important driver: TIC
Accordingly, by the fourth year, almost of half employees left the company.
In the third year, over 5000 employees left the company.
After the fourth year, the number of attrition is decreased.
2.4.6 Satisfaction vs Attrition rate
According to the correlation analysis, satisfaction to the company is negatively impacted. The values of satisfaction rate has two decimals. To understand easily, we create categories of employee satisfaction ranking.
The chart clearly indicates the lower satisfaction results lead to the high attrition rate.
It needs to take an action of satisfaction for work to improve the attrition rate.
2.4.7 New born vs Attrition rate
“1” indicates the employees who has a baby last year.
Leaving rate because of the baby was just 5.6%. so, having a new baby is not strong correlation with leaving company.
2.4.8 Number of project vs Attrition rate
Lower numbers of projects drive employees left.
More than 6 projects also drive employees left.
2.4.9 Monthly Averrage Working Hours
Over280 hours of average monthly working hours will not be a good sign.
3 PROCESS/ANALYZE
So far, we explored the features of variables, correlations, and visualizations between attrition and driving factors. From the data base, we found 16% of employees left from the company. “Time spent in the company” and “Satisfaction rate” are the main driving factors for the attrition. Then, who will leave a company at the high possibilities? To find out potential leaving employees are also the main purpose of this analysis. To answer to the question, we need to make a prediction model as well as visualizations of correlations of factors.
3.1 Model training
We use sample dataset of “DATA_4.02_HR3.csv” for testing model.
Code
#Load sample data with ca.8% of data for testing.#df_test <- read_csv("DATA_4.02_HR3.csv")# convert the newborn to factor variables#df_test=df_test %>% # mutate(Newborn = as.factor(Newborn))#head(df_test)
Code
list( summary_glm$coefficient, # Coefficient summary table about logistic regression to check p-value as significance.round( 1- ( summary_glm$deviance / summary_glm$null.deviance ), 4 ) ) #Formula for the pseudo R square value. How much the variances are explained by the model.
The pseudo R square value is 0.2106. It means only 21 percent of variance can be explained.
The Cause of the problem is from dataset itself.
To Improve this situation, more variables shall be collected for this dataset.
Due to the fact that we have nothing to do, we take to the next part.
3.2 Logistic regression on “left” variable
In the data set, values of leaving company is the binary decision, such as “YES 1” or “NO 0”. In this case, the logistic regression model will be suitable for making a prediction about the probability of leaving company.
3.3 Prediction Model
Code
# predictiondf_train$prediction <-predict( model_glm, newdata = df_train, type ="response" )df_test$prediction <-predict( model_glm, newdata = df_test , type ="response" )# distribution of the prediction score grouped by known outcomeggplot( df_train, aes( prediction, color =as.factor(left) ) ) +geom_density( size =1 ) +ggtitle( "Training Set, Predicted Score" ) +labs( color ="data" ) +scale_color_discrete(labels =c("Negative", "Positive"))
“Negative” value for “left” variable is clearly shaped as left skewer.
“Positive” values looks relatively like a left skewer, but it is because of 16% out of total numbers.
As density chart shows, the predicted score is not so much accuracy yet. To make a classification, we apply for a cutoff value. Since the prediction of a logistic regression model is a probability, in order to use it as a classifier, we’ll have to choose a cutoff value as 0.3 tentatively.
Code
# tidy from the broom packagecoefficient <-tidy(model_glm)[ , c( "term", "estimate", "statistic" ) ]# transfrom the coefficient to be in probability format coefficient$estimate <-exp( coefficient$estimate )coefficient
# A tibble: 7 × 3
term estimate statistic
<chr> <dbl> <dbl>
1 (Intercept) 0.330 -6.17
2 S 0.0227 -27.8
3 LPE 1.57 2.23
4 NP 0.691 -12.5
5 ANH 1.00 4.94
6 TIC 1.84 20.0
7 Newborn1 0.211 -11.9
When the value of “S” increases, the probability of leaving company will get low by 27.8.
When the value of “TIC” increases, the probability of leaving company will get high by 20.
3.4 Predicting
We have our logistic regression model as “model_glm”, we’ll load in the dataset again with outcome and use the model to predict the probability.
Code
# use the model to predict a unknown outcome data "DATA_4.02_HR3.csv"df_hr <-read_csv("DATA_4.02_HR3.csv" )
Rows: 1000 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (6): S, LPE, NP, ANH, TIC, Newborn
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.