Ask

Step 1: Identify Business Problems & Opportunities

Workforce lacks diversity
Wages are below the local average
Turnover needs to be reduced
Morale is low
Accidents need to be minimized if not eliminated
Production goals that have to be met

Step 2: Size and prioritize business opportunities

Sized all opportunities…and reducing turnover seems to be worth pursuing as priority one.

Acquire

Step 3: Data discovery and acquisition

Let’s suppose that someone else has organized a dataset that allows us to examine turnover drivers.

# readxl allows excel files to be read.
library(readxl)

## Warning: package 'readxl' was built under R version 3.5.3

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 3.5.3

## -- Attaching packages ------------- tidyverse 1.2.1 --

## v ggplot2 3.1.0       v purrr   0.3.1  
## v tibble  2.0.1       v dplyr   0.8.0.1
## v tidyr   0.8.3       v stringr 1.4.0  
## v readr   1.3.1       v forcats 0.4.0

## Warning: package 'ggplot2' was built under R version 3.5.3

## Warning: package 'tibble' was built under R version 3.5.3

## Warning: package 'tidyr' was built under R version 3.5.3

## Warning: package 'readr' was built under R version 3.5.3

## Warning: package 'purrr' was built under R version 3.5.3

## Warning: package 'dplyr' was built under R version 3.5.3

## Warning: package 'stringr' was built under R version 3.5.3

## Warning: package 'forcats' was built under R version 3.5.3

## -- Conflicts ---------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(ggplot2)
df <- read_excel("WA_Fn-UseC_-HR-Employee-Attrition.xlsx")

We have 1470 rows (observations) and 35 variables.
The “Attrition” column is our target (outcome variable or dependent variable), but it is a character variable. We need to convert it to numeric. The mutate function in the tidyverse will accomplish this by creating a new variable (“attrition”).

Step 4: Exploratory data analysis

Let’s make “Attrition” into a numerical variable and…

attrition <- df %>%
  mutate(attrition = ifelse(Attrition == "Yes", 1, 0))

…examine the data via plots to see if there are any trends. Here is a plot of data showing a possible association between department and attrition.

d <- attrition %>%
mutate(attrition = ifelse(Attrition == "Yes", 1, 0)) %>%
group_by(Department) %>%
summarize(attrition_rate = mean(attrition)) 
ggplot(d, aes(x = Department, y = attrition_rate)) + geom_bar(stat = "identity") + ggtitle("Turnover Rates Per Department") + coord_flip()

## Warning: Removed 1 rows containing missing values (position_stack).

Below is a plot of attrition by job roles.

d <- df %>%
mutate(attrition = ifelse(Attrition == "Yes", 1, 0)) %>%
group_by(JobRole) %>%
summarize(attrition_rate = mean(attrition)) 
ggplot(d, aes(x = JobRole, y = attrition_rate)) + geom_bar(stat = "identity") + ggtitle("Turnover Rates Per Job Role") + coord_flip()

## Warning: Removed 1 rows containing missing values (position_stack).

We can see that there are some jobs (e.g., Sales roles, HR) that seem to produce a lot of turnover. Other roles (e.g., being a research director) seem more associated with retention. However, we need to test these differences (and others) statistically, which brings us to step 5.

Step 5: Hypothesis Testing and Modeling

Confirmatory approach

Going with one variable at a time drawing on prior literature and intuition in picking the variables.

# H1: Department correlates with turnover.
H1 <- glm(attrition ~ Department, family=binomial(link='logit'), attrition)
summary(H1)

## 
## Call:
## glm(formula = attrition ~ Department, family = binomial(link = "logit"), 
##     data = attrition)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.6797  -0.6501  -0.5458  -0.5458   1.9888  
## 
## Coefficients:
##                                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                      -1.44692    0.32084  -4.510 6.49e-06 ***
## DepartmentResearch & Development -0.38175    0.33417  -1.142    0.253    
## DepartmentSales                   0.09941    0.34152   0.291    0.771    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1298.6  on 1469  degrees of freedom
## Residual deviance: 1288.1  on 1467  degrees of freedom
##   (1 observation deleted due to missingness)
## AIC: 1294.1
## 
## Number of Fisher Scoring iterations: 4

# No support

# H2: Job Role correlates with turnover.
H2 <- glm(attrition ~ JobRole, family=binomial(link='logit'), attrition)
summary(H2)

## 
## Call:
## glm(formula = attrition ~ JobRole, family = binomial(link = "logit"), 
##     data = attrition)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.0068  -0.6200  -0.5925  -0.3170   2.7162  
## 
## Coefficients:
##                                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                   -2.606796   0.345410  -7.547 4.45e-14 ***
## JobRoleHuman Resources         1.402824   0.477118   2.940 0.003280 ** 
## JobRoleLaboratory Technician   1.450727   0.374851   3.870 0.000109 ***
## JobRoleManager                -0.358477   0.574123  -0.624 0.532371    
## JobRoleManufacturing Director  0.004107   0.476146   0.009 0.993118    
## JobRoleResearch Director      -1.056765   0.795058  -1.329 0.183793    
## JobRoleResearch Scientist      0.955686   0.380350   2.513 0.011983 *  
## JobRoleSales Executive         1.055136   0.374926   2.814 0.004889 ** 
## JobRoleSales Representative    2.191281   0.411838   5.321 1.03e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1298.6  on 1469  degrees of freedom
## Residual deviance: 1209.7  on 1461  degrees of freedom
##   (1 observation deleted due to missingness)
## AIC: 1227.7
## 
## Number of Fisher Scoring iterations: 6

# Support

# H3: Job Role and Job Satisfaction correlate with turnover.
H3 <- glm(attrition ~ JobRole + JobSatisfaction, family=binomial(link='logit'), attrition)
summary(H3)

## 
## Call:
## glm(formula = attrition ~ JobRole + JobSatisfaction, family = binomial(link = "logit"), 
##     data = attrition)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.1928  -0.6419  -0.5030  -0.2993   2.7582  
## 
## Coefficients:
##                               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                   -1.90778    0.38354  -4.974 6.56e-07 ***
## JobRoleHuman Resources         1.35890    0.47949   2.834 0.004596 ** 
## JobRoleLaboratory Technician   1.44047    0.37628   3.828 0.000129 ***
## JobRoleManager                -0.38246    0.57558  -0.664 0.506380    
## JobRoleManufacturing Director -0.01905    0.47751  -0.040 0.968170    
## JobRoleResearch Director      -1.08026    0.79620  -1.357 0.174853    
## JobRoleResearch Scientist      0.96168    0.38171   2.519 0.011756 *  
## JobRoleSales Executive         1.05496    0.37628   2.804 0.005052 ** 
## JobRoleSales Representative    2.20841    0.41395   5.335 9.56e-08 ***
## JobSatisfaction               -0.26441    0.06586  -4.015 5.94e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1298.6  on 1469  degrees of freedom
## Residual deviance: 1193.5  on 1460  degrees of freedom
##   (1 observation deleted due to missingness)
## AIC: 1213.5
## 
## Number of Fisher Scoring iterations: 6

#compare models using a chi-square test.
anova(H2, H3, test="Chisq")

## Analysis of Deviance Table
## 
## Model 1: attrition ~ JobRole
## Model 2: attrition ~ JobRole + JobSatisfaction
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)    
## 1      1461     1209.7                         
## 2      1460     1193.5  1    16.17 5.79e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# Continue ad nausea

Looks like certain Job Roles and Job Satisfaction are uniquely associated with attrition. We could continue ad nausea until we’ve included all the variables that seem to predict attrition, but this would take a lot of time (and so is inefficient). Let’s try an inductive approach.

Inductive approach (anything we can measure)

Here, we throw everything into a regression model and see what predicts turnover. Once we identify what predicts turnover and what appears to be unrelated to turnover, we can trim the unnecessary variables from our model and report the findings. To avoid capitalizing on chance though, we need to split our data into a model building dataset (often called a training dataset) and a model testing dataset.

# Split data into Train/Validation/Test Sets
library(caTools)

## Warning: package 'caTools' was built under R version 3.5.3

set.seed(123) # Makes your work reproducible
sample <- sample.split(attrition,SplitRatio = 0.75) # splits the data in the ratio mentioned in SplitRatio. After splitting marks these rows as logical TRUE and the the remaining are marked as logical FALSE
train <- subset(attrition,sample ==TRUE) # creates a training dataset named train1 with rows which are marked as TRUE

## Warning: Length of logical index must be 1 or 1471, not 37

test <- subset(attrition, sample==FALSE)

## Warning: Length of logical index must be 1 or 1471, not 37

Having split our datasets, we’ll build a model that describes trends in our “train” dataset and then test this model on “test” dataset.

# Throw everything in and see what sticks
M1 <- glm(attrition ~ Age + BusinessTravel + DailyRate + Department + DistanceFromHome + Education + EducationField + EnvironmentSatisfaction + Gender + HourlyRate + JobInvolvement + JobLevel + JobRole + JobSatisfaction + MaritalStatus + MonthlyIncome + MonthlyRate + NumCompaniesWorked + OverTime + PercentSalaryHike + PerformanceRating + RelationshipSatisfaction + StockOptionLevel + TotalWorkingYears + TrainingTimesLastYear + WorkLifeBalance + YearsAtCompany + YearsInCurrentRole + YearsSinceLastPromotion + YearsWithCurrManager, family=binomial(link='logit'), data = train)

# Summary of M1
summary(M1)

## 
## Call:
## glm(formula = attrition ~ Age + BusinessTravel + DailyRate + 
##     Department + DistanceFromHome + Education + EducationField + 
##     EnvironmentSatisfaction + Gender + HourlyRate + JobInvolvement + 
##     JobLevel + JobRole + JobSatisfaction + MaritalStatus + MonthlyIncome + 
##     MonthlyRate + NumCompaniesWorked + OverTime + PercentSalaryHike + 
##     PerformanceRating + RelationshipSatisfaction + StockOptionLevel + 
##     TotalWorkingYears + TrainingTimesLastYear + WorkLifeBalance + 
##     YearsAtCompany + YearsInCurrentRole + YearsSinceLastPromotion + 
##     YearsWithCurrManager, family = binomial(link = "logit"), 
##     data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7815  -0.4673  -0.2275  -0.0797   3.4009  
## 
## Coefficients:
##                                    Estimate Std. Error z value Pr(>|z|)
## (Intercept)                      -1.185e+01  5.328e+02  -0.022 0.982259
## Age                              -3.613e-02  1.601e-02  -2.257 0.023999
## BusinessTravelTravel_Frequently   2.114e+00  4.922e-01   4.294 1.75e-05
## BusinessTravelTravel_Rarely       1.092e+00  4.483e-01   2.435 0.014904
## DailyRate                        -2.335e-04  2.682e-04  -0.871 0.383953
## DepartmentResearch & Development  1.344e+01  5.328e+02   0.025 0.979877
## DepartmentSales                   1.390e+01  5.328e+02   0.026 0.979190
## DistanceFromHome                  3.228e-02  1.338e-02   2.413 0.015825
## Education                         5.435e-02  1.074e-01   0.506 0.612699
## EducationFieldLife Sciences      -7.873e-01  9.603e-01  -0.820 0.412321
## EducationFieldMarketing          -3.272e-01  1.017e+00  -0.322 0.747728
## EducationFieldMedical            -1.061e+00  9.624e-01  -1.102 0.270404
## EducationFieldOther              -7.943e-01  1.026e+00  -0.774 0.438704
## EducationFieldTechnical Degree   -4.134e-02  9.972e-01  -0.041 0.966927
## EnvironmentSatisfaction          -4.734e-01  9.925e-02  -4.770 1.84e-06
## GenderMale                        7.190e-01  2.246e-01   3.201 0.001368
## HourlyRate                       -6.271e-04  5.330e-03  -0.118 0.906339
## JobInvolvement                   -5.972e-01  1.482e-01  -4.031 5.56e-05
## JobLevel                         -1.252e-01  3.857e-01  -0.325 0.745474
## JobRoleHuman Resources            1.553e+01  5.328e+02   0.029 0.976747
## JobRoleLaboratory Technician      2.361e+00  7.277e-01   3.245 0.001177
## JobRoleManager                    1.515e+00  1.079e+00   1.404 0.160201
## JobRoleManufacturing Director     1.310e+00  7.725e-01   1.696 0.089868
## JobRoleResearch Director         -6.364e-01  1.455e+00  -0.437 0.661761
## JobRoleResearch Scientist         1.513e+00  7.393e-01   2.046 0.040752
## JobRoleSales Executive            1.270e+00  1.288e+00   0.986 0.324263
## JobRoleSales Representative       2.475e+00  1.351e+00   1.832 0.066965
## JobSatisfaction                  -4.269e-01  9.971e-02  -4.281 1.86e-05
## MaritalStatusMarried              1.413e-01  3.156e-01   0.448 0.654474
## MaritalStatusSingle               1.020e+00  4.051e-01   2.519 0.011775
## MonthlyIncome                     3.475e-05  9.839e-05   0.353 0.723946
## MonthlyRate                       1.352e-05  1.505e-05   0.899 0.368838
## NumCompaniesWorked                2.134e-01  4.562e-02   4.678 2.90e-06
## OverTimeYes                       2.018e+00  2.338e-01   8.633  < 2e-16
## PercentSalaryHike                -6.217e-02  4.865e-02  -1.278 0.201207
## PerformanceRating                 3.579e-01  4.965e-01   0.721 0.471053
## RelationshipSatisfaction         -3.242e-01  1.001e-01  -3.238 0.001202
## StockOptionLevel                 -2.385e-01  1.867e-01  -1.277 0.201522
## TotalWorkingYears                -7.050e-02  3.689e-02  -1.911 0.055983
## TrainingTimesLastYear            -1.925e-01  8.630e-02  -2.230 0.025736
## WorkLifeBalance                  -3.140e-01  1.463e-01  -2.147 0.031806
## YearsAtCompany                    8.630e-02  4.883e-02   1.767 0.077185
## YearsInCurrentRole               -1.852e-01  5.570e-02  -3.325 0.000885
## YearsSinceLastPromotion           1.651e-01  5.087e-02   3.246 0.001171
## YearsWithCurrManager             -8.404e-02  5.994e-02  -1.402 0.160856
##                                     
## (Intercept)                         
## Age                              *  
## BusinessTravelTravel_Frequently  ***
## BusinessTravelTravel_Rarely      *  
## DailyRate                           
## DepartmentResearch & Development    
## DepartmentSales                     
## DistanceFromHome                 *  
## Education                           
## EducationFieldLife Sciences         
## EducationFieldMarketing             
## EducationFieldMedical               
## EducationFieldOther                 
## EducationFieldTechnical Degree      
## EnvironmentSatisfaction          ***
## GenderMale                       ** 
## HourlyRate                          
## JobInvolvement                   ***
## JobLevel                            
## JobRoleHuman Resources              
## JobRoleLaboratory Technician     ** 
## JobRoleManager                      
## JobRoleManufacturing Director    .  
## JobRoleResearch Director            
## JobRoleResearch Scientist        *  
## JobRoleSales Executive              
## JobRoleSales Representative      .  
## JobSatisfaction                  ***
## MaritalStatusMarried                
## MaritalStatusSingle              *  
## MonthlyIncome                       
## MonthlyRate                         
## NumCompaniesWorked               ***
## OverTimeYes                      ***
## PercentSalaryHike                   
## PerformanceRating                   
## RelationshipSatisfaction         ** 
## StockOptionLevel                    
## TotalWorkingYears                .  
## TrainingTimesLastYear            *  
## WorkLifeBalance                  *  
## YearsAtCompany                   .  
## YearsInCurrentRole               ***
## YearsSinceLastPromotion          ** 
## YearsWithCurrManager                
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 963.85  on 1071  degrees of freedom
## Residual deviance: 612.18  on 1027  degrees of freedom
##   (1 observation deleted due to missingness)
## AIC: 702.18
## 
## Number of Fisher Scoring iterations: 14

# Keep only the variables that matter
M2 <- glm(attrition ~  BusinessTravel + DistanceFromHome + EnvironmentSatisfaction + Gender + JobInvolvement + JobRole + JobSatisfaction + MaritalStatus + NumCompaniesWorked + OverTime + RelationshipSatisfaction + TotalWorkingYears + WorkLifeBalance + YearsInCurrentRole + YearsSinceLastPromotion, family=binomial(link='logit'), data = train)

# Summary of M2
summary(M2)

## 
## Call:
## glm(formula = attrition ~ BusinessTravel + DistanceFromHome + 
##     EnvironmentSatisfaction + Gender + JobInvolvement + JobRole + 
##     JobSatisfaction + MaritalStatus + NumCompaniesWorked + OverTime + 
##     RelationshipSatisfaction + TotalWorkingYears + WorkLifeBalance + 
##     YearsInCurrentRole + YearsSinceLastPromotion, family = binomial(link = "logit"), 
##     data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7746  -0.4933  -0.2636  -0.0929   3.9179  
## 
## Coefficients:
##                                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                     -0.95378    1.05872  -0.901 0.367649    
## BusinessTravelTravel_Frequently  2.16478    0.47731   4.535 5.75e-06 ***
## BusinessTravelTravel_Rarely      1.16372    0.43887   2.652 0.008010 ** 
## DistanceFromHome                 0.03258    0.01281   2.544 0.010949 *  
## EnvironmentSatisfaction         -0.42477    0.09369  -4.534 5.79e-06 ***
## GenderMale                       0.71965    0.21598   3.332 0.000862 ***
## JobInvolvement                  -0.61450    0.14185  -4.332 1.48e-05 ***
## JobRoleHuman Resources           2.66920    0.77535   3.443 0.000576 ***
## JobRoleLaboratory Technician     2.21627    0.66571   3.329 0.000871 ***
## JobRoleManager                   1.86802    0.82676   2.259 0.023856 *  
## JobRoleManufacturing Director    1.17577    0.75994   1.547 0.121817    
## JobRoleResearch Director        -0.16985    1.26577  -0.134 0.893256    
## JobRoleResearch Scientist        1.52910    0.67252   2.274 0.022986 *  
## JobRoleSales Executive           1.87098    0.65970   2.836 0.004567 ** 
## JobRoleSales Representative      2.96216    0.72076   4.110 3.96e-05 ***
## JobSatisfaction                 -0.41920    0.09494  -4.415 1.01e-05 ***
## MaritalStatusMarried             0.24098    0.29226   0.825 0.409624    
## MaritalStatusSingle              1.34343    0.29360   4.576 4.74e-06 ***
## NumCompaniesWorked               0.17654    0.04224   4.180 2.92e-05 ***
## OverTimeYes                      1.98867    0.22343   8.901  < 2e-16 ***
## RelationshipSatisfaction        -0.30319    0.09546  -3.176 0.001493 ** 
## TotalWorkingYears               -0.08239    0.02423  -3.400 0.000673 ***
## WorkLifeBalance                 -0.30374    0.13992  -2.171 0.029950 *  
## YearsInCurrentRole              -0.14933    0.04559  -3.276 0.001054 ** 
## YearsSinceLastPromotion          0.17792    0.04415   4.030 5.58e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 963.85  on 1071  degrees of freedom
## Residual deviance: 644.76  on 1047  degrees of freedom
##   (1 observation deleted due to missingness)
## AIC: 694.76
## 
## Number of Fisher Scoring iterations: 7

Looks like several variables predict attrition (e.g., business travel, distance from home, environmental satisfaction, etc.).

Step 6: Evaluating the Models

Having built a model, we need to see how well it is performing.

# Evaluate M2
anova(M1,M2, test="Chisq")

## Analysis of Deviance Table
## 
## Model 1: attrition ~ Age + BusinessTravel + DailyRate + Department + DistanceFromHome + 
##     Education + EducationField + EnvironmentSatisfaction + Gender + 
##     HourlyRate + JobInvolvement + JobLevel + JobRole + JobSatisfaction + 
##     MaritalStatus + MonthlyIncome + MonthlyRate + NumCompaniesWorked + 
##     OverTime + PercentSalaryHike + PerformanceRating + RelationshipSatisfaction + 
##     StockOptionLevel + TotalWorkingYears + TrainingTimesLastYear + 
##     WorkLifeBalance + YearsAtCompany + YearsInCurrentRole + YearsSinceLastPromotion + 
##     YearsWithCurrManager
## Model 2: attrition ~ BusinessTravel + DistanceFromHome + EnvironmentSatisfaction + 
##     Gender + JobInvolvement + JobRole + JobSatisfaction + MaritalStatus + 
##     NumCompaniesWorked + OverTime + RelationshipSatisfaction + 
##     TotalWorkingYears + WorkLifeBalance + YearsInCurrentRole + 
##     YearsSinceLastPromotion
##   Resid. Df Resid. Dev  Df Deviance Pr(>Chi)  
## 1      1027     612.18                        
## 2      1047     644.76 -20  -32.584  0.03746 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# Report M2
table <- anova(M2, test = "Chisq")

# Get R2
library(pscl)

## Warning: package 'pscl' was built under R version 3.5.3

## Classes and Methods for R developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University
## Simon Jackman
## hurdle and zeroinfl functions by Achim Zeileis

pR2(M2) # Model predicts roughly 33% (see McFadden R2) of the variance in turnover.

##          llh      llhNull           G2     McFadden         r2ML 
## -322.3798656 -481.9271633  319.0945954    0.3310610    0.2574484 
##         r2CU 
##    0.4340933

Note: the first model (M1) has relatively identical fit to the data as (M2). M2 accounts for about 33% of the variance in attrition and has an accuracy of rougly 90% when applied to new data. To get a sense of how well this model predicts attrition, consider that a model predicting lung cancer diagnoses using only smoking behavior accounts for less than 1% of all instances of lung cancer in a 25 year period (note: this still makes smoking the leading known cause of lung cancer). So we have a model that orders of magnitude more effective! This is great.

Here, we briefly evaluated the fitting of the model. Now we would like to see how the model works when predicting attrition on a new set of data. By setting the parameter type=‘response’, R will output probabilities in the form of P(y=1|X). Our decision boundary will be 0.5. If P(y=1|X) > 0.5 then y = 1 otherwise y=0. Note that for some applications different thresholds could be a better option.

# Fit model to test data and assess accuracy "out of sample"
fitted.results <- predict(M2,newdata=subset(test),type='response')
threshold <- .5
fitted.results <- ifelse(fitted.results > threshold,1,0)

# Guage how frequently (percentage) misclassification errors occur
misClasificError <- mean(fitted.results != test$attrition)
print(paste('Accuracy',1-misClasificError))

## [1] "Accuracy 0.881909547738693"

The 0.90 accuracy on the test set is quite a good result. We are able to predict with 90% accuracy whether or not someone is likely to leave. To better appreciate how the model works, lets build a confusion matrix to see this d ata.

# Conver fitted.results to a data frame
fitted.results <- as.data.frame(fitted.results)

# Activate the caret package
library(caret)

## Warning: package 'caret' was built under R version 3.5.3

## Loading required package: lattice

## 
## Attaching package: 'caret'

## The following object is masked from 'package:purrr':
## 
##     lift

# Build dataset and confusion matrix
cf.data <- cbind(test,fitted.results)

# Confusion Matrix
caret::confusionMatrix(as.factor(cf.data$fitted.results), as.factor(cf.data$attrition), positive = "1") # caret::confusionMatrix is selected because another package (InformationValue) also has a confusion matrix function, and it is not nearly as informative as the function provided by the caret package.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 327  35
##          1  12  24
##                                           
##                Accuracy : 0.8819          
##                  95% CI : (0.8461, 0.9119)
##     No Information Rate : 0.8518          
##     P-Value [Acc > NIR] : 0.049206        
##                                           
##                   Kappa : 0.4426          
##                                           
##  Mcnemar's Test P-Value : 0.001332        
##                                           
##             Sensitivity : 0.40678         
##             Specificity : 0.96460         
##          Pos Pred Value : 0.66667         
##          Neg Pred Value : 0.90331         
##              Prevalence : 0.14824         
##          Detection Rate : 0.06030         
##    Detection Prevalence : 0.09045         
##       Balanced Accuracy : 0.68569         
##                                           
##        'Positive' Class : 1               
##

It is important to understand accuracy can be misleading: ~90% sounds pretty good especially for modeling HR data, but if we just focus on instances when we are predicting someone is going to leave, a different message emerges [true positive / (true positive + false positive)]. Termed “precision”, this is [20 / (20+9) = 69%] 69%. In other words, we are correclty predicting that someone will leave 69% of the time. In this case, highly precise models accurately predict when someone is going to leave.

HR will want to know how accurately we can predict whether or not someone will leave. A metric called “recall”, which above is refered to as the “sensitivity” of the model, allows us to assess [true positive / (true positive + false negative); 20 / (20 + 27)]. Here, recall or sensitivity is 43%. In turnover terms, this tells us that of all instances of someone leaving, we are able to correctly predict these occurrences 43% of the time. Here, you can see that there are a lot of people that we miss.

Step 7: Interpret the Model

In this steps, we going to interpret the model. Specifically, we’re going to examine the odds of leaving or staying as a function of the predictors in the model.

# Odds ratios only coerced into table and round it to the nearest hundreths.
tbl1 <- as.data.frame(round(exp(coef(M2)),2))

# Relable Variable "exp(coef(M2))" as "Odds"
colnames(tbl1)[colnames(tbl1)=="round(exp(coef(M2)), 2)"] <-"Odds"

# Subset data into (i) drivers of turnover and (ii) drivers of retention.
d.turnover <- subset(tbl1, Odds >=1)
d.retention <- subset(tbl1, Odds <=1)

# Transform retention factor into odds favoring retention by dividing 1 by each individual odds ratio.
d.retention$Odds.R <- round(1/d.retention$Odds,2)

# Rename Odds.R
colnames(d.retention)[colnames(d.retention)=="Odds.r"] <-"Odds"

# Convert row Id to variable
library(data.table)

## Warning: package 'data.table' was built under R version 3.5.3

## 
## Attaching package: 'data.table'

## The following objects are masked from 'package:dplyr':
## 
##     between, first, last

## The following object is masked from 'package:purrr':
## 
##     transpose

setDT(d.turnover, keep.rownames = TRUE)[] # Does this for turnover figures

##                                  rn  Odds
##  1: BusinessTravelTravel_Frequently  8.71
##  2:     BusinessTravelTravel_Rarely  3.20
##  3:                DistanceFromHome  1.03
##  4:                      GenderMale  2.05
##  5:          JobRoleHuman Resources 14.43
##  6:    JobRoleLaboratory Technician  9.17
##  7:                  JobRoleManager  6.48
##  8:   JobRoleManufacturing Director  3.24
##  9:       JobRoleResearch Scientist  4.61
## 10:          JobRoleSales Executive  6.49
## 11:     JobRoleSales Representative 19.34
## 12:            MaritalStatusMarried  1.27
## 13:             MaritalStatusSingle  3.83
## 14:              NumCompaniesWorked  1.19
## 15:                     OverTimeYes  7.31
## 16:         YearsSinceLastPromotion  1.19

setDT(d.retention, keep.rownames = TRUE)[] # Does this for retention figures

##                          rn Odds Odds.R
## 1:              (Intercept) 0.39   2.56
## 2:  EnvironmentSatisfaction 0.65   1.54
## 3:           JobInvolvement 0.54   1.85
## 4: JobRoleResearch Director 0.84   1.19
## 5:          JobSatisfaction 0.66   1.52
## 6: RelationshipSatisfaction 0.74   1.35
## 7:        TotalWorkingYears 0.92   1.09
## 8:          WorkLifeBalance 0.74   1.35
## 9:       YearsInCurrentRole 0.86   1.16

The first table prints the odds of leaving. For instance, if someone travels frequently (relative to not traveling frequently), controlling for all other influences one is 5.64 times more likely to leave.

The second table prints the odds of staying (see “Odds.R”). Here, for instance, someone who works in the Research Director role (compared to not working in this role), controlling for all other influencdes one is 4.55 times more likely to leave.

These are perfect data to plot.

Step 8: Build visualizations and data products

I’ve chosen a bar graph to do this (see function “geom_bar()” below).

# Re-order data
d.turnover$rn <- factor(d.turnover$rn, levels = d.turnover$rn[order(d.turnover$Odds)])
d.turnover$rn

##  [1] BusinessTravelTravel_Frequently BusinessTravelTravel_Rarely    
##  [3] DistanceFromHome                GenderMale                     
##  [5] JobRoleHuman Resources          JobRoleLaboratory Technician   
##  [7] JobRoleManager                  JobRoleManufacturing Director  
##  [9] JobRoleResearch Scientist       JobRoleSales Executive         
## [11] JobRoleSales Representative     MaritalStatusMarried           
## [13] MaritalStatusSingle             NumCompaniesWorked             
## [15] OverTimeYes                     YearsSinceLastPromotion        
## 16 Levels: DistanceFromHome NumCompaniesWorked ... JobRoleSales Representative

# Rename risk factors
colnames(d.turnover)[colnames(d.turnover)=="rn"] <-"Factor"

# Plot
ggplot(d.turnover, aes(x = Factor, y = Odds)) + geom_bar(stat = "identity", fill = "#FF9999") + ggtitle("Working overtime makes you 7.4 times more likely to leave") + coord_flip() + geom_text(aes(label=Odds),position=position_dodge(0.9),hjust=-.05) + theme(axis.text.x=element_blank())+ labs(x = element_blank(), y = "Odds")

Notice that this plot has protected class information in it (gender, marital status). Let’s remove that before we share with management.

# Remove protected classes
d.turnover <- d.turnover[-c(1,5,12,13),]

# Rebuild plot
ggplot(d.turnover, aes(x = Factor, y = Odds)) + geom_bar(stat = "identity", fill = "#FF9999") + ggtitle("Working overtime makes you 7.4 times more likely to leave") + coord_flip() + geom_text(aes(label=Odds),position=position_dodge(0.9),hjust=-.05) + theme(axis.text.x=element_blank())+ labs(x = element_blank(), y = "Odds")

# Save ggplot
ggsave("drivers of turnover.png")

## Saving 7 x 5 in image

Now lets visualize drivers of retention.

# Re-order data
d.retention$rn <- factor(d.retention$rn, levels = d.retention$rn[order(d.retention$Odds.R)])
d.retention$rn

## [1] (Intercept)              EnvironmentSatisfaction 
## [3] JobInvolvement           JobRoleResearch Director
## [5] JobSatisfaction          RelationshipSatisfaction
## [7] TotalWorkingYears        WorkLifeBalance         
## [9] YearsInCurrentRole      
## 9 Levels: TotalWorkingYears ... (Intercept)

# Rename risk factors
colnames(d.retention)[colnames(d.retention)=="rn"] <-"Factor"

# Plot
ggplot(d.retention, aes(x = Factor, y = Odds.R)) + geom_bar(stat = "identity", fill = "#66CC99") + ggtitle("Having work-life balance makes you 1.39 times more likely to stay") + coord_flip() + geom_text(aes(label=Odds.R),position=position_dodge(0.9),hjust=-.05) + theme(axis.text.x=element_blank())+ labs(x = element_blank(), y = "Odds")

# Save ggplot
ggsave("drivers of retention.png")

## Saving 7 x 5 in image

Step 9: Pilot design and testing

At this stage, you’d probably conduct a focus group with employees and managers to gather feedback on the results. They will probably say that these results make sense and fit with their intuition (some may disagree, and you need to probe further why that is the case).

Their feedback, along with your independent research will inform next steps regarding making changes in the organization. For instance, one promising intervention seems to be incorporating a staffing optimization tool that will prepdict when someone will leave so that you do not have to burden the workers you have with overtime. In fact, probably anything doing with workforce planning would be helpful for preventing overtime. However,you might need to consider pilot testing intervention before rolling it out to the whole company.

Step 9: Design and implement

In this step, you’ve conducted a pilot study or two and found out how you might reduce turnover (e.g., by reducing overtime, making the work more pleasant, etc.). Great! Roll this out to the rest of the company and congratulate yourself! You’ve helped make work not suck!