The Data

This sample data set is provided by IBM through the Watson Analytics Sample Data, it can be obtained following the next URL:

https://www.ibm.com/communities/analytics/watson-analytics-blog/hr-employee-attrition/

Last accessed on 03/28/2019

First look on the data

str(employees)
## 'data.frame':    1470 obs. of  35 variables:
##  $ Age                     : int  41 49 37 33 27 32 59 30 38 36 ...
##  $ Attrition               : Factor w/ 2 levels "No","Yes": 2 1 2 1 1 1 1 1 1 1 ...
##  $ BusinessTravel          : Factor w/ 3 levels "Non-Travel","Travel_Frequently",..: 3 2 3 2 3 2 3 3 2 3 ...
##  $ DailyRate               : int  1102 279 1373 1392 591 1005 1324 1358 216 1299 ...
##  $ Department              : Factor w/ 3 levels "Human Resources",..: 3 2 2 2 2 2 2 2 2 2 ...
##  $ DistanceFromHome        : int  1 8 2 3 2 2 3 24 23 27 ...
##  $ Education               : int  2 1 2 4 1 2 3 1 3 3 ...
##  $ EducationField          : Factor w/ 6 levels "Human Resources",..: 2 2 5 2 4 2 4 2 2 4 ...
##  $ EmployeeCount           : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ EmployeeNumber          : int  1 2 4 5 7 8 10 11 12 13 ...
##  $ EnvironmentSatisfaction : int  2 3 4 4 1 4 3 4 4 3 ...
##  $ Gender                  : Factor w/ 2 levels "Female","Male": 1 2 2 1 2 2 1 2 2 2 ...
##  $ HourlyRate              : int  94 61 92 56 40 79 81 67 44 94 ...
##  $ JobInvolvement          : int  3 2 2 3 3 3 4 3 2 3 ...
##  $ JobLevel                : int  2 2 1 1 1 1 1 1 3 2 ...
##  $ JobRole                 : Factor w/ 9 levels "Healthcare Representative",..: 8 7 3 7 3 3 3 3 5 1 ...
##  $ JobSatisfaction         : int  4 2 3 3 2 4 1 3 3 3 ...
##  $ MaritalStatus           : Factor w/ 3 levels "Divorced","Married",..: 3 2 3 2 2 3 2 1 3 2 ...
##  $ MonthlyIncome           : int  5993 5130 2090 2909 3468 3068 2670 2693 9526 5237 ...
##  $ MonthlyRate             : int  19479 24907 2396 23159 16632 11864 9964 13335 8787 16577 ...
##  $ NumCompaniesWorked      : int  8 1 6 1 9 0 4 1 0 6 ...
##  $ Over18                  : Factor w/ 1 level "Y": 1 1 1 1 1 1 1 1 1 1 ...
##  $ OverTime                : Factor w/ 2 levels "No","Yes": 2 1 2 2 1 1 2 1 1 1 ...
##  $ PercentSalaryHike       : int  11 23 15 11 12 13 20 22 21 13 ...
##  $ PerformanceRating       : int  3 4 3 3 3 3 4 4 4 3 ...
##  $ RelationshipSatisfaction: int  1 4 2 3 4 3 1 2 2 2 ...
##  $ StandardHours           : int  80 80 80 80 80 80 80 80 80 80 ...
##  $ StockOptionLevel        : int  0 1 0 0 1 0 3 1 0 2 ...
##  $ TotalWorkingYears       : int  8 10 7 8 6 8 12 1 10 17 ...
##  $ TrainingTimesLastYear   : int  0 3 3 3 3 2 3 2 2 3 ...
##  $ WorkLifeBalance         : int  1 3 3 3 3 2 2 3 3 2 ...
##  $ YearsAtCompany          : int  6 10 0 8 2 7 1 1 9 7 ...
##  $ YearsInCurrentRole      : int  4 7 0 7 2 7 0 0 7 7 ...
##  $ YearsSinceLastPromotion : int  0 1 0 3 2 3 0 0 1 7 ...
##  $ YearsWithCurrManager    : int  5 7 0 0 2 6 0 0 8 7 ...

From this output we can see there are some variables that can be removed as they don’t give useful information:

employees = select(employees, -c(Over18, StandardHours, EmployeeCount, EmployeeNumber))

There are other variables we couldn’t find their meaning so we’re also removing them:

employees = select(employees, -c(HourlyRate, MonthlyRate, DailyRate))

Data cleaning and preparation

We start by assigning the proper type to factor variables:

employees$WorkLifeBalance = as.factor(employees$WorkLifeBalance)
employees$JobSatisfaction = as.factor(employees$JobSatisfaction)
employees$StockOptionLevel = as.factor(employees$StockOptionLevel)
employees$RelationshipSatisfaction = as.factor(employees$RelationshipSatisfaction)
employees$PerformanceRating = as.factor(employees$PerformanceRating)
employees$JobLevel = as.factor(employees$JobLevel)
employees$JobInvolvement = as.factor(employees$JobInvolvement)
employees$EnvironmentSatisfaction = as.factor(employees$EnvironmentSatisfaction)
employees$Education = as.factor(employees$Education)
employees$EducationField = as.factor(employees$EducationField)

Exploratory Data Analysis

The dependent variable has a pronounced class imbalance that must be addressed to archive a good and it might affect our analysis:

As we have several variable types in this data set we are exploring them by type because each type might require different considerations.

Nominal Variables

employeeNom = 
  select(employees, 
          c(Attrition, Department, EducationField, Gender, JobRole, MaritalStatus, OverTime))

First we check the relationship between the independent variables and the dependent variable, for this we execute a Chi Squared Test for each variable in this section:

According to the Chi Squared tests we can start to have an idea of which variables have an association with the dependent variable, let’s try to measure the strenght of the association between the variables using the Goodman and Kruskal’s tau measure:

We see that JobRole, Deparment and EducationField are clearly related but the strenght of the association doesn’t seem to be significant where it should be given our previous test, this results might be affected by the class imbalance we detected earlier.

Ordinal Variables

First we’ll try assuming there’s an underlying numerical value in the ordinal variable, also we can assume that the distance between each ordinal value is the same, we’ll support this assuptions looking at a best fit line for each of this variables:

employeeOrd = 
  select(employees, 
         c(Attrition, BusinessTravel, Education
           , EnvironmentSatisfaction, JobInvolvement
           , JobLevel, PerformanceRating, RelationshipSatisfaction
           , StockOptionLevel, WorkLifeBalance))

Now we can with rasonable safety use Chi Squared tests to check the relation between the indepentend ordinal variables with the dependent variable, as they seem to have a linear relationship with Attrition:

We see from the P-Values obtained for each test that we can’t find evidence of a relation between RelationshipSatisfaction, PerformanceRating and Education with the Attrition.

Continuous Variables

For the continuous variables we can observe most of them have some kind of relation to the dependent variable by calculating a line of best fit:

We’ll use a Pearson’s Correlation test against the binary dependent variable to check this reatioship in detail:

Test Conclusion

These tests results are odd, considering some variables that should have an impact on the Attrition show no evidence for that relation (e.g. Performance Ratig, Education, PercentSalaryHike). In the next section we’ll try to build some models with this dataset, but it seems better data is needed to achieve an useful prediction.

Model Building

Now we’ll evaluate some models to find a good fit. We start by setting our seed and splitting the data into training (70%) and test (30%) sets.

set.seed(42)
sampleValue = sample.split(employees,SplitRatio=0.7) 
employeeTrain = subset(employees, sampleValue == TRUE) 
employeeTest = subset(employees, sampleValue == FALSE)

Logistic Regression

Our first approach is to apply logistic regression, setting our treshold to 0.5:

logisticRegression <- glm(Attrition ~.,family=binomial(link='logit'),data=employeeTrain)

employeeTest = employeeTest %>% mutate(Attrition = ifelse(Attrition == "No",0,1))
logitPrediction <- predict(logisticRegression, newdata=employeeTest,type='response')
logitPrediction <- ifelse(logitPrediction > 0.5,1,0)

Let’s check the resulting confusion matrix to evaluate the performance of this first model:

## Confusion Matrix and Statistics
## 
##                
## logitPrediction   0   1
##               0 378  31
##               1  18  44
##                                          
##                Accuracy : 0.896          
##                  95% CI : (0.8648, 0.922)
##     No Information Rate : 0.8408         
##     P-Value [Acc > NIR] : 0.0003742      
##                                          
##                   Kappa : 0.5821         
##                                          
##  Mcnemar's Test P-Value : 0.0864763      
##                                          
##             Sensitivity : 0.9545         
##             Specificity : 0.5867         
##          Pos Pred Value : 0.9242         
##          Neg Pred Value : 0.7097         
##              Prevalence : 0.8408         
##          Detection Rate : 0.8025         
##    Detection Prevalence : 0.8684         
##       Balanced Accuracy : 0.7706         
##                                          
##        'Positive' Class : 0              
## 

We got a good accuracy but there’s a considerable number of false negatives and we can see that the No Information Rate is high, which means that for this data we have our resulting model is not too different from just predicting everything as Non Attrition (0). The lower Kappa value also points in this direction.

From the model output we can also confirm some of the results from our exploratory analysis about the variables that have more impact over the dependent variable, however, there are some discrepancies that should be studied as the tests we performed threw a different result than the model’s:

## 
## Call:
## glm(formula = Attrition ~ ., family = binomial(link = "logit"), 
##     data = employeeTrain)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8938  -0.4729  -0.2107  -0.0557   3.5400  
## 
## Coefficients:
##                                    Estimate Std. Error z value Pr(>|z|)
## (Intercept)                      -1.121e+01  6.868e+02  -0.016 0.986978
## Age                              -4.343e-02  1.712e-02  -2.537 0.011180
## BusinessTravelTravel_Frequently   2.148e+00  5.260e-01   4.085 4.41e-05
## BusinessTravelTravel_Rarely       1.319e+00  4.848e-01   2.720 0.006530
## DepartmentResearch & Development  1.366e+01  6.868e+02   0.020 0.984136
## DepartmentSales                   1.341e+01  6.868e+02   0.020 0.984426
## DistanceFromHome                  5.863e-02  1.388e-02   4.225 2.39e-05
## Education2                        5.376e-01  4.201e-01   1.280 0.200664
## Education3                        4.980e-01  3.729e-01   1.335 0.181718
## Education4                        6.227e-01  4.103e-01   1.518 0.129089
## Education5                        1.225e+00  7.348e-01   1.667 0.095448
## EducationFieldLife Sciences      -8.107e-01  9.711e-01  -0.835 0.403795
## EducationFieldMarketing          -2.569e-01  1.033e+00  -0.249 0.803629
## EducationFieldMedical            -6.303e-01  9.679e-01  -0.651 0.514901
## EducationFieldOther              -2.928e-01  1.063e+00  -0.275 0.783072
## EducationFieldTechnical Degree    4.280e-01  9.905e-01   0.432 0.665670
## EnvironmentSatisfaction2         -1.047e+00  3.511e-01  -2.980 0.002878
## EnvironmentSatisfaction3         -1.023e+00  3.207e-01  -3.192 0.001414
## EnvironmentSatisfaction4         -1.269e+00  3.315e-01  -3.828 0.000129
## GenderMale                        4.207e-01  2.337e-01   1.801 0.071770
## JobInvolvement2                  -1.157e+00  4.734e-01  -2.445 0.014483
## JobInvolvement3                  -1.430e+00  4.567e-01  -3.132 0.001739
## JobInvolvement4                  -1.877e+00  5.815e-01  -3.228 0.001248
## JobLevel2                        -1.750e+00  5.616e-01  -3.116 0.001832
## JobLevel3                        -2.202e-01  8.707e-01  -0.253 0.800344
## JobLevel4                        -2.353e+00  1.556e+00  -1.512 0.130422
## JobLevel5                         1.045e+00  2.009e+00   0.520 0.603106
## JobRoleHuman Resources            1.464e+01  6.868e+02   0.021 0.982996
## JobRoleLaboratory Technician      6.583e-01  7.267e-01   0.906 0.364990
## JobRoleManager                   -1.219e+00  1.477e+00  -0.825 0.409356
## JobRoleManufacturing Director     3.746e-01  6.596e-01   0.568 0.570079
## JobRoleResearch Director         -2.131e+00  1.411e+00  -1.510 0.130960
## JobRoleResearch Scientist        -6.794e-02  7.306e-01  -0.093 0.925913
## JobRoleSales Executive            1.723e+00  1.480e+00   1.164 0.244351
## JobRoleSales Representative       1.347e+00  1.579e+00   0.853 0.393837
## JobSatisfaction2                 -6.924e-01  3.367e-01  -2.056 0.039746
## JobSatisfaction3                 -7.032e-01  3.015e-01  -2.332 0.019692
## JobSatisfaction4                 -1.269e+00  3.150e-01  -4.029 5.60e-05
## MaritalStatusMarried              5.954e-01  3.559e-01   1.673 0.094341
## MaritalStatusSingle               1.089e+00  5.091e-01   2.140 0.032380
## MonthlyIncome                     7.345e-06  1.135e-04   0.065 0.948388
## NumCompaniesWorked                1.844e-01  5.009e-02   3.681 0.000232
## OverTimeYes                       2.016e+00  2.534e-01   7.954 1.81e-15
## PercentSalaryHike                -3.372e-02  5.048e-02  -0.668 0.504163
## PerformanceRating4                4.488e-01  5.057e-01   0.887 0.374836
## RelationshipSatisfaction2        -6.298e-01  3.519e-01  -1.790 0.073496
## RelationshipSatisfaction3        -9.386e-01  3.247e-01  -2.891 0.003840
## RelationshipSatisfaction4        -9.160e-01  3.218e-01  -2.846 0.004425
## StockOptionLevel1                -8.466e-01  3.952e-01  -2.142 0.032156
## StockOptionLevel2                -9.129e-01  5.515e-01  -1.655 0.097851
## StockOptionLevel3                -4.230e-01  5.946e-01  -0.711 0.476855
## TotalWorkingYears                -4.746e-02  3.660e-02  -1.297 0.194707
## TrainingTimesLastYear            -2.139e-01  9.174e-02  -2.332 0.019697
## WorkLifeBalance2                 -9.323e-01  4.788e-01  -1.947 0.051524
## WorkLifeBalance3                 -1.540e+00  4.521e-01  -3.407 0.000657
## WorkLifeBalance4                 -1.667e+00  5.798e-01  -2.874 0.004047
## YearsAtCompany                    1.489e-01  4.741e-02   3.141 0.001685
## YearsInCurrentRole               -2.170e-01  6.163e-02  -3.520 0.000431
## YearsSinceLastPromotion           1.437e-01  5.269e-02   2.727 0.006391
## YearsWithCurrManager             -1.304e-01  5.915e-02  -2.205 0.027470
##                                     
## (Intercept)                         
## Age                              *  
## BusinessTravelTravel_Frequently  ***
## BusinessTravelTravel_Rarely      ** 
## DepartmentResearch & Development    
## DepartmentSales                     
## DistanceFromHome                 ***
## Education2                          
## Education3                          
## Education4                          
## Education5                       .  
## EducationFieldLife Sciences         
## EducationFieldMarketing             
## EducationFieldMedical               
## EducationFieldOther                 
## EducationFieldTechnical Degree      
## EnvironmentSatisfaction2         ** 
## EnvironmentSatisfaction3         ** 
## EnvironmentSatisfaction4         ***
## GenderMale                       .  
## JobInvolvement2                  *  
## JobInvolvement3                  ** 
## JobInvolvement4                  ** 
## JobLevel2                        ** 
## JobLevel3                           
## JobLevel4                           
## JobLevel5                           
## JobRoleHuman Resources              
## JobRoleLaboratory Technician        
## JobRoleManager                      
## JobRoleManufacturing Director       
## JobRoleResearch Director            
## JobRoleResearch Scientist           
## JobRoleSales Executive              
## JobRoleSales Representative         
## JobSatisfaction2                 *  
## JobSatisfaction3                 *  
## JobSatisfaction4                 ***
## MaritalStatusMarried             .  
## MaritalStatusSingle              *  
## MonthlyIncome                       
## NumCompaniesWorked               ***
## OverTimeYes                      ***
## PercentSalaryHike                   
## PerformanceRating4                  
## RelationshipSatisfaction2        .  
## RelationshipSatisfaction3        ** 
## RelationshipSatisfaction4        ** 
## StockOptionLevel1                *  
## StockOptionLevel2                .  
## StockOptionLevel3                   
## TotalWorkingYears                   
## TrainingTimesLastYear            *  
## WorkLifeBalance2                 .  
## WorkLifeBalance3                 ***
## WorkLifeBalance4                 ** 
## YearsAtCompany                   ** 
## YearsInCurrentRole               ***
## YearsSinceLastPromotion          ** 
## YearsWithCurrManager             *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 885.59  on 998  degrees of freedom
## Residual deviance: 557.09  on 939  degrees of freedom
## AIC: 677.09
## 
## Number of Fisher Scoring iterations: 15

Decision Tree

Let’s try now with a decision tree, this should offer an easier to explain model on how the variables and their values affect the dependent variable:

## 
## Classification tree:
## rpart(formula = Attrition ~ ., data = employeeTrain, control = rpart.control(cp = 1e-04))
## 
## Variables actually used in tree construction:
##  [1] Age                      BusinessTravel          
##  [3] DistanceFromHome         Education               
##  [5] EducationField           EnvironmentSatisfaction 
##  [7] JobInvolvement           JobLevel                
##  [9] JobRole                  JobSatisfaction         
## [11] MonthlyIncome            OverTime                
## [13] RelationshipSatisfaction StockOptionLevel        
## [15] TotalWorkingYears        WorkLifeBalance         
## [17] YearsAtCompany           YearsSinceLastPromotion 
## 
## Root node error: 162/999 = 0.16216
## 
## n= 999 
## 
##          CP nsplit rel error  xerror     xstd
## 1 0.0524691      0   1.00000 1.00000 0.071915
## 2 0.0246914      2   0.89506 0.93827 0.070075
## 3 0.0185185      4   0.84568 0.93210 0.069886
## 4 0.0123457      7   0.79012 0.97531 0.071192
## 5 0.0046296     17   0.66049 1.00000 0.071915
## 6 0.0041152     21   0.64198 1.05556 0.073488
## 7 0.0020576     24   0.62963 1.06173 0.073658
## 8 0.0001000     27   0.62346 1.08025 0.074163

Confusion matrix for the decision tree:

## Confusion Matrix and Statistics
## 
##               
## treePrediction   0   1
##              0 373  58
##              1  23  17
##                                          
##                Accuracy : 0.828          
##                  95% CI : (0.7909, 0.861)
##     No Information Rate : 0.8408         
##     P-Value [Acc > NIR] : 0.7948461      
##                                          
##                   Kappa : 0.2079         
##                                          
##  Mcnemar's Test P-Value : 0.0001582      
##                                          
##             Sensitivity : 0.9419         
##             Specificity : 0.2267         
##          Pos Pred Value : 0.8654         
##          Neg Pred Value : 0.4250         
##              Prevalence : 0.8408         
##          Detection Rate : 0.7919         
##    Detection Prevalence : 0.9151         
##       Balanced Accuracy : 0.5843         
##                                          
##        'Positive' Class : 0              
## 

Random Forest

Now we’ll try to get a better result with a random forest:

forest = randomForest(Attrition ~., data = employeeTrain, ntree = 500, mtry = 10, importance = T)

forestPrediction <- predict(forest, newdata = employeeTest, type = 'class')

forestPrediction = ifelse(forestPrediction == "No",0,1)
#confusionMatrix(data = forestPrediction, reference = employeeTest$Attrition)
confusionMatrix(table(forestPrediction, employeeTest$Attrition)) 
## Confusion Matrix and Statistics
## 
##                 
## forestPrediction   0   1
##                0 394  62
##                1   2  13
##                                           
##                Accuracy : 0.8641          
##                  95% CI : (0.8298, 0.8938)
##     No Information Rate : 0.8408          
##     P-Value [Acc > NIR] : 0.09108         
##                                           
##                   Kappa : 0.249           
##                                           
##  Mcnemar's Test P-Value : 1.643e-13       
##                                           
##             Sensitivity : 0.9949          
##             Specificity : 0.1733          
##          Pos Pred Value : 0.8640          
##          Neg Pred Value : 0.8667          
##              Prevalence : 0.8408          
##          Detection Rate : 0.8365          
##    Detection Prevalence : 0.9682          
##       Balanced Accuracy : 0.5841          
##                                           
##        'Positive' Class : 0               
## 

We couldn’t get better results using decicion trees or random forest for the same reasons, we must try to overcome the class imbalace to be able to improve the Attrition cases prediction.

Treating the Class Imbalance

From the confusion matrix output we got for each model we can see that the class imbalance is hidering our capacity to predict attrition cases, which is the class we are actually interested in. We’ll explore a couple of methods to improve our model.

Over sampling

We’ll try to get more cases for Attrition by over sampling these cases in the data set

## 
##  No Yes 
## 837 763
## Confusion Matrix and Statistics
## 
##                     
## logitPrediction_over   0   1
##                    0 316  15
##                    1  80  60
##                                           
##                Accuracy : 0.7983          
##                  95% CI : (0.7592, 0.8336)
##     No Information Rate : 0.8408          
##     P-Value [Acc > NIR] : 0.994           
##                                           
##                   Kappa : 0.4425          
##                                           
##  Mcnemar's Test P-Value : 5.159e-11       
##                                           
##             Sensitivity : 0.7980          
##             Specificity : 0.8000          
##          Pos Pred Value : 0.9547          
##          Neg Pred Value : 0.4286          
##              Prevalence : 0.8408          
##          Detection Rate : 0.6709          
##    Detection Prevalence : 0.7028          
##       Balanced Accuracy : 0.7990          
##                                           
##        'Positive' Class : 0               
## 

Our model is actually performing worse in this case, so let’s try Over Sampling and Under Sampling:

## 
##   No  Yes 
## 1798  648
## Confusion Matrix and Statistics
## 
##                      
## logitPrediction_smote   0   1
##                     0 358  24
##                     1  38  51
##                                           
##                Accuracy : 0.8684          
##                  95% CI : (0.8344, 0.8976)
##     No Information Rate : 0.8408          
##     P-Value [Acc > NIR] : 0.05507         
##                                           
##                   Kappa : 0.543           
##                                           
##  Mcnemar's Test P-Value : 0.09874         
##                                           
##             Sensitivity : 0.9040          
##             Specificity : 0.6800          
##          Pos Pred Value : 0.9372          
##          Neg Pred Value : 0.5730          
##              Prevalence : 0.8408          
##          Detection Rate : 0.7601          
##    Detection Prevalence : 0.8110          
##       Balanced Accuracy : 0.7920          
##                                           
##        'Positive' Class : 0               
## 

We achieved a small improvement when trying to balance the classes but this is not enough to have a good prediction, also the proportion of new samples might introduce effects we’re not considering, like generalizing the classes for example.

Setting the Threshold

Looking at the models we trained we can see the sensitivity and specificity show some trade-off we could try to address to obtain the results we want, which is to identify Attrition cases, as misslabeling a non-Attrition case doesn’t seem harmful we’ll try lowering the threshold to 0.2 and see the results:

logisticRegression <- glm(Attrition ~.,family=binomial(link='logit'),data=employeeTrain)

employeeTest = employeeTest %>% mutate(Attrition = ifelse(Attrition == "No",0,1))
logitPrediction <- predict(logisticRegression, newdata=employeeTest,type='response')
logitPrediction <- ifelse(logitPrediction > 0.2,1,0)
## Confusion Matrix and Statistics
## 
##                
## logitPrediction   0   1
##               0 325  14
##               1  71  61
##                                           
##                Accuracy : 0.8195          
##                  95% CI : (0.7818, 0.8532)
##     No Information Rate : 0.8408          
##     P-Value [Acc > NIR] : 0.9053          
##                                           
##                   Kappa : 0.4847          
##                                           
##  Mcnemar's Test P-Value : 1.247e-09       
##                                           
##             Sensitivity : 0.8207          
##             Specificity : 0.8133          
##          Pos Pred Value : 0.9587          
##          Neg Pred Value : 0.4621          
##              Prevalence : 0.8408          
##          Detection Rate : 0.6900          
##    Detection Prevalence : 0.7197          
##       Balanced Accuracy : 0.8170          
##                                           
##        'Positive' Class : 0               
## 

Conclusions

With the data available at the moment we weren’t able to build a model that properly predicts employee attrition: