HR Analytics finds out the people-related trends in the data and helps the HR Department take the appropriate steps to keep the organization running smoothly and profitably. Attrition is a corporate setup is one of the complex challenges that the people managers and the HRs personnel have to deal with it.
In this research assignment, we investigated data on employee attrition of a company. This is a fictional data set created by IBM data scientists.
We have collected this dataset from Kaggle, using the below link: https://www.kaggle.com/pavansubhasht/ibm-hr-analytics-attrition-dataset
We obtained the data set from Kaggle.com using this link: [https://www.kaggle.com/pavansubhasht/ibm-hr-analytics-attrition-dataset]. The fictional data set was originally created by IBM data scientists to uncover the facts that lead to employee attrition and explore important question like what are the important factors influence attrition among employees. Also the original dataset can be accessed from the Link- https://www.ibm.com/communities/analytics/watson-analytics-blog/hr-employee-attrition/. Then it was saved in our group github repository as a .csv file for the convenience of the analysis purposes. The Attrition dataset had 1470 observations with 35 variables. Out of those 35 there exists the targen variable Attrition with possible outcomes “Yes” and “No”. With our experiment results We will do the analysis based on Gender, Education, Income, Working Environment, and lastly, build a predictive model to determine whether an employee is going to quit or not.
## Age Attrition BusinessTravel
## 0 0 0
## DailyRate Department DistanceFromHome
## 0 0 0
## Education EducationField EmployeeCount
## 0 0 0
## EmployeeNumber EnvironmentSatisfaction Gender
## 0 0 0
## HourlyRate JobInvolvement JobLevel
## 0 0 0
## JobRole JobSatisfaction MaritalStatus
## 0 0 0
## MonthlyIncome MonthlyRate NumCompaniesWorked
## 0 0 0
## Over18 OverTime PercentSalaryHike
## 0 0 0
## PerformanceRating RelationshipSatisfaction StandardHours
## 0 0 0
## StockOptionLevel TotalWorkingYears TrainingTimesLastYear
## 0 0 0
## WorkLifeBalance YearsAtCompany YearsInCurrentRole
## 0 0 0
## YearsSinceLastPromotion YearsWithCurrManager
## 0 0
## Data Set has 1470 Rows and 31 Columns
Fortunately no missing data or duplicate data.
Also, some of the attributes that are categorical are represented as integers in the dataset. We need to change them to categorical.
In this section, we can visualize the infuance of each variable on Attrition of the organization.
20. Over Time: Larger Proportion of Overtime Employees are quitting. 21. Percent Salary Hike: We see that people with less than 15% hike have more chances to leave. 22. Performance Rating: 1-‘Low’, 2-‘Good’, 3-‘Excellent’, 4-‘Outstanding’. We see that we have employees of only 3 and 4 ratings. Lesser proportion of 4 raters quit. 23. Relationship Satisfaction: 1-‘Low’, 2-‘Medium’, 3-‘High’, 4-‘Very High’. Higher number of people with 3 or more rating are quitiing. There are considerable amount of low and medium relationship satisfaction in this organization.
24. Stock Option Level: Larger proportions of levels 1 & 2 tend to quit more. 25. Total Working Years: We see larger proportions of people with 1 year of experiences quitting the organization also in bracket of 1-10 Years. Higher the number of experience you have, you tend to stay in the job. 26. Traning Times Last Year: This indicates the no of training interventions the employee has attended. People who have been trained 2-4 times is an area of concern. 27. Work Life Balance:Ratings as per Metadata is 1 ‘Bad’ 2 ‘Good’ 3 ‘Better’ 4 ‘Best’. As expected larger proportion of 1 rating quit, but absolute number wise 3 is on higher side.
Below plot shows correlated variables, such as with Attrition overtime is positively correlated however MonthlyIncome negatively co-related.
To get all numerical data we will apply below changes on some attributes.
Lets split the data in to 2 parts i.e train and test. Train contains 75% of data and test contains 25% of data.
Train data has 1103 rows and 31 coumns. Test data has 367 rows and 31 columns.
We will apply logistic regression model with all variables.
##
## Call:
## glm(formula = Attrition ~ ., family = binomial(link = "logit"),
## data = hr_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9441 -0.5282 -0.3031 -0.1262 3.4353
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 6.341e+00 1.281e+00 4.950 7.41e-07 ***
## Age -3.520e-02 1.507e-02 -2.336 0.019493 *
## BusinessTravel -8.743e-02 1.485e-01 -0.589 0.555968
## DailyRate -4.530e-04 2.398e-04 -1.889 0.058871 .
## Department 6.392e-01 2.827e-01 2.261 0.023753 *
## DistanceFromHome 3.301e-02 1.182e-02 2.794 0.005207 **
## Education 2.208e-02 9.474e-02 0.233 0.815682
## EducationField 5.396e-02 7.237e-02 0.746 0.455919
## EnvironmentSatisfaction -3.548e-01 9.110e-02 -3.895 9.83e-05 ***
## Gender -4.557e-01 2.081e-01 -2.190 0.028540 *
## HourlyRate -5.238e-03 4.833e-03 -1.084 0.278444
## JobInvolvement -5.471e-01 1.385e-01 -3.951 7.80e-05 ***
## JobLevel -2.766e-01 3.248e-01 -0.852 0.394434
## JobRole -5.732e-02 5.800e-02 -0.988 0.322994
## JobSatisfaction -3.153e-01 9.040e-02 -3.488 0.000487 ***
## MaritalStatus -1.773e-01 1.274e-01 -1.391 0.164107
## MonthlyIncome -1.905e-05 7.851e-05 -0.243 0.808319
## MonthlyRate 1.121e-05 1.394e-05 0.804 0.421443
## NumCompaniesWorked 1.993e-01 4.296e-02 4.640 3.48e-06 ***
## OverTime 1.794e+00 2.091e-01 8.582 < 2e-16 ***
## PercentSalaryHike -8.461e-03 4.311e-02 -0.196 0.844410
## PerformanceRating -1.744e-01 4.396e-01 -0.397 0.691608
## RelationshipSatisfaction -3.116e-01 9.069e-02 -3.436 0.000590 ***
## StockOptionLevel -4.925e-01 1.249e-01 -3.943 8.03e-05 ***
## TotalWorkingYears -8.091e-02 3.269e-02 -2.475 0.013324 *
## TrainingTimesLastYear -1.761e-01 7.839e-02 -2.247 0.024665 *
## WorkLifeBalance -2.937e-01 1.338e-01 -2.195 0.028148 *
## YearsAtCompany 4.248e-02 4.476e-02 0.949 0.342579
## YearsInCurrentRole -1.561e-01 5.090e-02 -3.066 0.002169 **
## YearsSinceLastPromotion 1.945e-01 4.634e-02 4.198 2.69e-05 ***
## YearsWithCurrManager -3.895e-02 5.112e-02 -0.762 0.446113
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 974.94 on 1102 degrees of freedom
## Residual deviance: 703.79 on 1072 degrees of freedom
## AIC: 765.79
##
## Number of Fisher Scoring iterations: 6
Values | |
---|---|
Accuracy | 0.8664850 |
precision | 0.8865672 |
F1-sensitivity | 0.9642857 |
specificity | 0.3559322 |
f1_score | 0.9237947 |
AUC | 0.6601090 |
There are some insignificant variable present in model1 so this model we will remove those variables.
##
## Call:
## glm(formula = Attrition ~ . - BusinessTravel - Department - Education -
## EducationField - Gender - HourlyRate - JobLevel - JobRole -
## MaritalStatus - MonthlyIncome - MonthlyRate - PercentSalaryHike -
## PerformanceRating - TotalWorkingYears - TrainingTimesLastYear -
## YearsAtCompany, family = binomial(link = "logit"), data = hr_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0635 -0.5549 -0.3356 -0.1744 3.2644
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 5.5567355 0.8172084 6.800 1.05e-11 ***
## Age -0.0728333 0.0125475 -5.805 6.45e-09 ***
## DailyRate -0.0004540 0.0002331 -1.948 0.051429 .
## DistanceFromHome 0.0312206 0.0113019 2.762 0.005738 **
## EnvironmentSatisfaction -0.3058260 0.0871700 -3.508 0.000451 ***
## JobInvolvement -0.5259334 0.1326082 -3.966 7.31e-05 ***
## JobSatisfaction -0.2630740 0.0856637 -3.071 0.002133 **
## NumCompaniesWorked 0.1619426 0.0397724 4.072 4.67e-05 ***
## OverTime 1.6906651 0.1975211 8.559 < 2e-16 ***
## RelationshipSatisfaction -0.2730902 0.0868695 -3.144 0.001668 **
## StockOptionLevel -0.4376040 0.1204280 -3.634 0.000279 ***
## WorkLifeBalance -0.3230708 0.1282947 -2.518 0.011796 *
## YearsInCurrentRole -0.1591994 0.0447435 -3.558 0.000374 ***
## YearsSinceLastPromotion 0.1667490 0.0400291 4.166 3.10e-05 ***
## YearsWithCurrManager -0.0582141 0.0424817 -1.370 0.170583
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 974.94 on 1102 degrees of freedom
## Residual deviance: 744.14 on 1088 degrees of freedom
## AIC: 774.14
##
## Number of Fisher Scoring iterations: 6
Values | |
---|---|
Accuracy | 0.8637602 |
precision | 0.8794118 |
F1-sensitivity | 0.9707792 |
specificity | 0.3050847 |
f1_score | 0.9228395 |
AUC | 0.6379320 |
Random forest is a supervised learning algorithm. The “forest” it builds, is an ensemble of decision trees, usually trained with the “bagging” method. The general idea of the bagging method is that a combination of learning models increases the overall result.
Values | |
---|---|
Accuracy | 0.8528610 |
precision | 0.8649425 |
F1-sensitivity | 0.9772727 |
specificity | 0.2033898 |
f1_score | 0.9176829 |
AUC | 0.5903313 |
Classification or regression technique that generates decision trees sequentially, where each tree focuses on correcting the previous tree model. The final output is a combination of the results from all trees.
## [21:46:19] WARNING: amalgamation/../src/learner.cc:516:
## Parameters: { set_seed } might not be used.
##
## This may not be accurate due to some parameters are only used in language bindings but
## passed down to XGBoost core. Or some parameters are not used but slip through this
## verification. Please open an issue if you find above cases.
##
##
## [1] train-auc:0.828813
## [2] train-auc:0.896875
## [3] train-auc:0.917373
## [4] train-auc:0.957574
## [5] train-auc:0.963365
## [6] train-auc:0.973456
## [7] train-auc:0.986195
## [8] train-auc:0.989371
## [9] train-auc:0.992609
## [10] train-auc:0.995153
## [11] train-auc:0.997243
## [12] train-auc:0.998433
## [13] train-auc:0.999247
## [14] train-auc:0.999338
## [15] train-auc:0.999472
## [16] train-auc:0.999836
## [17] train-auc:0.999885
## [18] train-auc:0.999958
## [19] train-auc:0.999994
## [20] train-auc:1.000000
Values | |
---|---|
Accuracy | 0.8610354 |
precision | 0.8768328 |
F1-sensitivity | 0.9707792 |
specificity | 0.2881356 |
f1_score | 0.9214176 |
AUC | 0.6294574 |
Above xgb.plot.shap
shows top 5 feature has more impact on attrition. MonthlyIncome graph clearly shows persion has high income tends less likely to leave the company than the person have low income.
A technique that’s typically used for classification but can be transformed to perform regression. It draws a division between classes that’s as wise as possible
Values | |
---|---|
Accuracy | 0.8746594 |
precision | 0.8922156 |
F1-sensitivity | 0.9675325 |
specificity | 0.3898305 |
f1_score | 0.9283489 |
AUC | 0.6786815 |
Decision Trees are a type of Supervised Machine Learning (that is you explain what the input is and what the corresponding output is in the training data) where the data is continuously split according to a certain parameter.
Values | |
---|---|
Accuracy | 0.8256131 |
precision | 0.8588235 |
F1-sensitivity | 0.9480519 |
specificity | 0.1864407 |
f1_score | 0.9012346 |
AUC | 0.5672463 |
To see matrices from all models add the data in to a dataframe.
Among all 6 models, Model 1 is doing well due to high accuracy and and high AUC.
From what we have seen so far, we have been able to come to the following conclusions:
The Logistic Regression models provides better result than any other models.
Reference :
XGBOOST - https://www.youtube.com/watch?v=frCu6eSI8R0
SVM - https://www.datacamp.com/community/tutorials/support-vector-machines-r
RF - https://towardsdatascience.com/random-forest-in-r-f66adf80ec9