Final Project

Matthew Baker, Don Padmaperuma, Subhalaxmi Rout, Erinda Budo

2020-12-15

Abstract

HR Analytics finds out the people-related trends in the data and helps the HR Department take the appropriate steps to keep the organization running smoothly and profitably. Attrition is a corporate setup is one of the complex challenges that the people managers and the HRs personnel have to deal with it.

In this research assignment, we investigated data on employee attrition of a company. This is a fictional data set created by IBM data scientists.

We have collected this dataset from Kaggle, using the below link: https://www.kaggle.com/pavansubhasht/ibm-hr-analytics-attrition-dataset

Methodology

We obtained the data set from Kaggle.com using this link: [https://www.kaggle.com/pavansubhasht/ibm-hr-analytics-attrition-dataset]. The fictional data set was originally created by IBM data scientists to uncover the facts that lead to employee attrition and explore important question like what are the important factors influence attrition among employees. Also the original dataset can be accessed from the Link- https://www.ibm.com/communities/analytics/watson-analytics-blog/hr-employee-attrition/. Then it was saved in our group github repository as a .csv file for the convenience of the analysis purposes. The Attrition dataset had 1470 observations with 35 variables. Out of those 35 there exists the targen variable Attrition with possible outcomes “Yes” and “No”. With our experiment results We will do the analysis based on Gender, Education, Income, Working Environment, and lastly, build a predictive model to determine whether an employee is going to quit or not.

Experimantation and Results

Data Preparation

Checking for missing values and removing non value attributes

##                      Age                Attrition           BusinessTravel 
##                        0                        0                        0 
##                DailyRate               Department         DistanceFromHome 
##                        0                        0                        0 
##                Education           EducationField            EmployeeCount 
##                        0                        0                        0 
##           EmployeeNumber  EnvironmentSatisfaction                   Gender 
##                        0                        0                        0 
##               HourlyRate           JobInvolvement                 JobLevel 
##                        0                        0                        0 
##                  JobRole          JobSatisfaction            MaritalStatus 
##                        0                        0                        0 
##            MonthlyIncome              MonthlyRate       NumCompaniesWorked 
##                        0                        0                        0 
##                   Over18                 OverTime        PercentSalaryHike 
##                        0                        0                        0 
##        PerformanceRating RelationshipSatisfaction            StandardHours 
##                        0                        0                        0 
##         StockOptionLevel        TotalWorkingYears    TrainingTimesLastYear 
##                        0                        0                        0 
##          WorkLifeBalance           YearsAtCompany       YearsInCurrentRole 
##                        0                        0                        0 
##  YearsSinceLastPromotion     YearsWithCurrManager 
##                        0                        0

## Data Set has  1470  Rows and  31  Columns

Fortunately no missing data or duplicate data.

Also, some of the attributes that are categorical are represented as integers in the dataset. We need to change them to categorical.

Visualization

In this section, we can visualize the infuance of each variable on Attrition of the organization.

Age plot and Fig 1

Age: We see that majority of employees leaving the org are around 30 Years (year 28-36). Average age is between 30 to 40.
Business Travel: Among people who leave, most travel frequently or rarely.
Department: Among people attrited employees from HR dept. are less.It is because of low proportion of HR in the organization(Fig 1).
Distance From Home: Contrary to normal assumptions, a mojority of employees who have left the organization are near to the Office.
Daily Rate: We are not able to see any distinguishable feature here(Fig 1).

Fig 2

Education: From the data we know that, 1-‘Below College’, 2-‘College’, 3-‘Bachelor’, 4-‘Master’ 5 ‘Doctor’. Looking at the plot we see that very few Doctors attrite. May be because of less number. Based on the data most of the employees have Bachelors degree level education.
Education Field: On lines of the trend in Departments, a minority of HR educated employees leave and it is majorly because of low proportion of the HR in the organization.
Employee Count : It is an insignificant variable for us.
Employee Number: It is also an insignificant variable for us.
Environment Satisfaction: Ratings stand for: 1-‘Low’, 2-‘Medium’, 3-‘High’, 4-‘Very High’. We don’t see any distinguishable feature(Fig 2).

Fig 3

Gender: Majority of separated employees are Male and the reason might be because around 61% of employees in the dataset are Male.
HourlyRate : There seems to be no straightforward relation with the Daily Rate of the employees.
Job Involvement: Ratings stand for 1-‘Low’, 2-‘Medium’, 3-‘High’, 4-‘Very High’. Majority of employees who leave are either very highly involved or least involved in their Jobs.
JobLevel: Job Level increases the number of people quitting decreases.
Job Satisfaction: As per data 1-‘Low’, 2-‘Medium’, 3-‘High’, 4-‘Very High’. We see higher attrition levels among lower Job Satisfaction levels.

Fig 4

Marital Status:Attrition is on higher side for Single and lowest for Divorced employees. Most employees are married.
Monthly Income: We see higher levels of attrition among the lower segment of monthly income. If looked at in isolation, might be due to dissatisfaction of income.Higher number of employees earn less.
Monthly Rate: We don’t see any inferable trend from this. Also no straightforwad relation with Monthly Income.
Number of Companies Worked: We see a clear indication that many people who have worked only in One company before quit a lot.

Fig 5

20. Over Time: Larger Proportion of Overtime Employees are quitting. 21. Percent Salary Hike: We see that people with less than 15% hike have more chances to leave. 22. Performance Rating: 1-‘Low’, 2-‘Good’, 3-‘Excellent’, 4-‘Outstanding’. We see that we have employees of only 3 and 4 ratings. Lesser proportion of 4 raters quit. 23. Relationship Satisfaction: 1-‘Low’, 2-‘Medium’, 3-‘High’, 4-‘Very High’. Higher number of people with 3 or more rating are quitiing. There are considerable amount of low and medium relationship satisfaction in this organization.

Fig 6

24. Stock Option Level: Larger proportions of levels 1 & 2 tend to quit more. 25. Total Working Years: We see larger proportions of people with 1 year of experiences quitting the organization also in bracket of 1-10 Years. Higher the number of experience you have, you tend to stay in the job. 26. Traning Times Last Year: This indicates the no of training interventions the employee has attended. People who have been trained 2-4 times is an area of concern. 27. Work Life Balance:Ratings as per Metadata is 1 ‘Bad’ 2 ‘Good’ 3 ‘Better’ 4 ‘Best’. As expected larger proportion of 1 rating quit, but absolute number wise 3 is on higher side.

Fig 7

Years at Company: Larger proportion of new comers are quitting the organization. Which sidelines the recruitment efforts of the organization.
Years In Current Role: Plot shows a larger proportion with just 0 years quitting. May be a role change is a trigger for Quitting.
Years Since Last Promotion: Larger proportion of people who have been promoted recently have quit the organization.
Years With Current Manager: As expected a new Manager is a big cause for quitting.

Correlation

Below plot shows correlated variables, such as with Attrition overtime is positively correlated however MonthlyIncome negatively co-related.

To get all numerical data we will apply below changes on some attributes.

BUSINESS TRAVEL (1=Non Travel, 2=Travel Frequently, 3=Tavel Rarely)
DEPARTMENT (1=Human Resources, 2=Research & Development, 3=Sales)
EDUCATION FIELD (1=Human Resources, 2=LIFE SCIENCES, 3=MARKETING, 4=MEDICAL SCIENCES, 5=OTHERS, 6= TEHCNICAL DEGREE)
GENDER (2=FEMALE, 1=MALE)
JOB ROLE (1=Healthcare Representative, 2=Human Resources, 3=Laboratory Technician, 4=MANAGER, 5= Manufacturing Director, 6= REASEARCH DIRECTOR, 7= RESEARCH SCIENTIST, 8=SALES EXECUTIEVE, 9= SALES REPRESENTATIVE)
MARITAL STATUS (1=DIVORCED, 2=SINGLE, 3=MARRIED)
OVERTIME (1=NO, 2=YES)

Data Preparation

Lets split the data in to 2 parts i.e train and test. Train contains 75% of data and test contains 25% of data.

Train data has 1103 rows and 31 coumns. Test data has 367 rows and 31 columns.

Model Building

Model1 - Logistic Regression with all features

We will apply logistic regression model with all variables.

## 
## Call:
## glm(formula = Attrition ~ ., family = binomial(link = "logit"), 
##     data = hr_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9441  -0.5282  -0.3031  -0.1262   3.4353  
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               6.341e+00  1.281e+00   4.950 7.41e-07 ***
## Age                      -3.520e-02  1.507e-02  -2.336 0.019493 *  
## BusinessTravel           -8.743e-02  1.485e-01  -0.589 0.555968    
## DailyRate                -4.530e-04  2.398e-04  -1.889 0.058871 .  
## Department                6.392e-01  2.827e-01   2.261 0.023753 *  
## DistanceFromHome          3.301e-02  1.182e-02   2.794 0.005207 ** 
## Education                 2.208e-02  9.474e-02   0.233 0.815682    
## EducationField            5.396e-02  7.237e-02   0.746 0.455919    
## EnvironmentSatisfaction  -3.548e-01  9.110e-02  -3.895 9.83e-05 ***
## Gender                   -4.557e-01  2.081e-01  -2.190 0.028540 *  
## HourlyRate               -5.238e-03  4.833e-03  -1.084 0.278444    
## JobInvolvement           -5.471e-01  1.385e-01  -3.951 7.80e-05 ***
## JobLevel                 -2.766e-01  3.248e-01  -0.852 0.394434    
## JobRole                  -5.732e-02  5.800e-02  -0.988 0.322994    
## JobSatisfaction          -3.153e-01  9.040e-02  -3.488 0.000487 ***
## MaritalStatus            -1.773e-01  1.274e-01  -1.391 0.164107    
## MonthlyIncome            -1.905e-05  7.851e-05  -0.243 0.808319    
## MonthlyRate               1.121e-05  1.394e-05   0.804 0.421443    
## NumCompaniesWorked        1.993e-01  4.296e-02   4.640 3.48e-06 ***
## OverTime                  1.794e+00  2.091e-01   8.582  < 2e-16 ***
## PercentSalaryHike        -8.461e-03  4.311e-02  -0.196 0.844410    
## PerformanceRating        -1.744e-01  4.396e-01  -0.397 0.691608    
## RelationshipSatisfaction -3.116e-01  9.069e-02  -3.436 0.000590 ***
## StockOptionLevel         -4.925e-01  1.249e-01  -3.943 8.03e-05 ***
## TotalWorkingYears        -8.091e-02  3.269e-02  -2.475 0.013324 *  
## TrainingTimesLastYear    -1.761e-01  7.839e-02  -2.247 0.024665 *  
## WorkLifeBalance          -2.937e-01  1.338e-01  -2.195 0.028148 *  
## YearsAtCompany            4.248e-02  4.476e-02   0.949 0.342579    
## YearsInCurrentRole       -1.561e-01  5.090e-02  -3.066 0.002169 ** 
## YearsSinceLastPromotion   1.945e-01  4.634e-02   4.198 2.69e-05 ***
## YearsWithCurrManager     -3.895e-02  5.112e-02  -0.762 0.446113    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 974.94  on 1102  degrees of freedom
## Residual deviance: 703.79  on 1072  degrees of freedom
## AIC: 765.79
## 
## Number of Fisher Scoring iterations: 6

	Values
Accuracy	0.8664850
precision	0.8865672
F1-sensitivity	0.9642857
specificity	0.3559322
f1_score	0.9237947
AUC	0.6601090

Model2 - Logistic Regression with significant features

There are some insignificant variable present in model1 so this model we will remove those variables.

## 
## Call:
## glm(formula = Attrition ~ . - BusinessTravel - Department - Education - 
##     EducationField - Gender - HourlyRate - JobLevel - JobRole - 
##     MaritalStatus - MonthlyIncome - MonthlyRate - PercentSalaryHike - 
##     PerformanceRating - TotalWorkingYears - TrainingTimesLastYear - 
##     YearsAtCompany, family = binomial(link = "logit"), data = hr_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0635  -0.5549  -0.3356  -0.1744   3.2644  
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               5.5567355  0.8172084   6.800 1.05e-11 ***
## Age                      -0.0728333  0.0125475  -5.805 6.45e-09 ***
## DailyRate                -0.0004540  0.0002331  -1.948 0.051429 .  
## DistanceFromHome          0.0312206  0.0113019   2.762 0.005738 ** 
## EnvironmentSatisfaction  -0.3058260  0.0871700  -3.508 0.000451 ***
## JobInvolvement           -0.5259334  0.1326082  -3.966 7.31e-05 ***
## JobSatisfaction          -0.2630740  0.0856637  -3.071 0.002133 ** 
## NumCompaniesWorked        0.1619426  0.0397724   4.072 4.67e-05 ***
## OverTime                  1.6906651  0.1975211   8.559  < 2e-16 ***
## RelationshipSatisfaction -0.2730902  0.0868695  -3.144 0.001668 ** 
## StockOptionLevel         -0.4376040  0.1204280  -3.634 0.000279 ***
## WorkLifeBalance          -0.3230708  0.1282947  -2.518 0.011796 *  
## YearsInCurrentRole       -0.1591994  0.0447435  -3.558 0.000374 ***
## YearsSinceLastPromotion   0.1667490  0.0400291   4.166 3.10e-05 ***
## YearsWithCurrManager     -0.0582141  0.0424817  -1.370 0.170583    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 974.94  on 1102  degrees of freedom
## Residual deviance: 744.14  on 1088  degrees of freedom
## AIC: 774.14
## 
## Number of Fisher Scoring iterations: 6

	Values
Accuracy	0.8637602
precision	0.8794118
F1-sensitivity	0.9707792
specificity	0.3050847
f1_score	0.9228395
AUC	0.6379320

Model3 - Random Forest

Random forest is a supervised learning algorithm. The “forest” it builds, is an ensemble of decision trees, usually trained with the “bagging” method. The general idea of the bagging method is that a combination of learning models increases the overall result.

	Values
Accuracy	0.8528610
precision	0.8649425
F1-sensitivity	0.9772727
specificity	0.2033898
f1_score	0.9176829
AUC	0.5903313

Model4 - XGB

Classification or regression technique that generates decision trees sequentially, where each tree focuses on correcting the previous tree model. The final output is a combination of the results from all trees.

## [21:46:19] WARNING: amalgamation/../src/learner.cc:516: 
## Parameters: { set_seed } might not be used.
## 
##   This may not be accurate due to some parameters are only used in language bindings but
##   passed down to XGBoost core.  Or some parameters are not used but slip through this
##   verification. Please open an issue if you find above cases.
## 
## 
## [1]  train-auc:0.828813 
## [2]  train-auc:0.896875 
## [3]  train-auc:0.917373 
## [4]  train-auc:0.957574 
## [5]  train-auc:0.963365 
## [6]  train-auc:0.973456 
## [7]  train-auc:0.986195 
## [8]  train-auc:0.989371 
## [9]  train-auc:0.992609 
## [10] train-auc:0.995153 
## [11] train-auc:0.997243 
## [12] train-auc:0.998433 
## [13] train-auc:0.999247 
## [14] train-auc:0.999338 
## [15] train-auc:0.999472 
## [16] train-auc:0.999836 
## [17] train-auc:0.999885 
## [18] train-auc:0.999958 
## [19] train-auc:0.999994 
## [20] train-auc:1.000000

	Values
Accuracy	0.8610354
precision	0.8768328
F1-sensitivity	0.9707792
specificity	0.2881356
f1_score	0.9214176
AUC	0.6294574

Above xgb.plot.shap shows top 5 feature has more impact on attrition. MonthlyIncome graph clearly shows persion has high income tends less likely to leave the company than the person have low income.

Model 5 - Support Vector Machines

A technique that’s typically used for classification but can be transformed to perform regression. It draws a division between classes that’s as wise as possible

	Values
Accuracy	0.8746594
precision	0.8922156
F1-sensitivity	0.9675325
specificity	0.3898305
f1_score	0.9283489
AUC	0.6786815

Model 6 - Decesion Tree

Decision Trees are a type of Supervised Machine Learning (that is you explain what the input is and what the corresponding output is in the training data) where the data is continuously split according to a certain parameter.

model6 <- rpart(formula = Attrition ~., data=hr_train)
plot(model6)
text(model6)

	Values
Accuracy	0.8256131
precision	0.8588235
F1-sensitivity	0.9480519
specificity	0.1864407
f1_score	0.9012346
AUC	0.5672463

To see matrices from all models add the data in to a dataframe.

Among all 6 models, Model 1 is doing well due to high accuracy and and high AUC.

Summary

From what we have seen so far, we have been able to come to the following conclusions:

Employees stay if they have high income or more working years at company
Employees who stay they gets promotion or they are satisfy with their ratings
Employees who have stock options they more likely to stay
Employees prefer to leave if they do overtime

The Logistic Regression models provides better result than any other models.

Reference :

XGBOOST - https://www.youtube.com/watch?v=frCu6eSI8R0

SVM - https://www.datacamp.com/community/tutorials/support-vector-machines-r

RF - https://towardsdatascience.com/random-forest-in-r-f66adf80ec9