getwd()

## [1] "C:/Users/joncd/Documents"

The following analysis represents an approach to analyzing employee attrition via survival analysis. We will be using a publicly available simulated dataset from IBM which can be downloaded from Kaggle here IBM Data.

Additional helpful links can be found at:

*Cox PH Test in R

*Assumption Testing

*Survival Analysis Basics

*In Depth Discussion of Survival Analysis in R

*More Survival Analysis in R

We will begin by importing our library, importing our data, and conducting some simple data transformations to make the anlaysis easier on ourselves.

library(readr)
library(tidyverse)
library(survival)
library(survminer)
WA_Fn_UseC_HR_Employee_Attrition <- read_csv("C:/Users/joncd/Downloads/WA_Fn-UseC_-HR-Employee-Attrition.csv")
data <- as_tibble(WA_Fn_UseC_HR_Employee_Attrition)
set.seed(150) # For Reproducibility
# Simple data cleaning
workset <- data %>% 
  mutate(BusinessTravel = as.factor(BusinessTravel),
         Department = as.factor(Department),
         Education = as.factor(Education),
         EducationField = as.factor(EducationField),
         Gender = as.factor(Gender),
         JobLevel = as.factor(JobLevel),
         JobRole = as.factor(JobRole),
         MaritalStatus = as.factor(MaritalStatus),
         OverTime = as.factor(OverTime),
         Over18 = as.factor(Over18),
         StockOptionLevel = as.factor(StockOptionLevel),
         Attrition = ifelse(Attrition == "No",0,1))

Let’s take a quick look at our dataset

glimpse(workset)

## Rows: 1,470
## Columns: 35
## $ Age                      <dbl> 41, 49, 37, 33, 27, 32, 59, 30, 38, 36, 35, 2~
## $ Attrition                <dbl> 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, ~
## $ BusinessTravel           <fct> Travel_Rarely, Travel_Frequently, Travel_Rare~
## $ DailyRate                <dbl> 1102, 279, 1373, 1392, 591, 1005, 1324, 1358,~
## $ Department               <fct> Sales, Research & Development, Research & Dev~
## $ DistanceFromHome         <dbl> 1, 8, 2, 3, 2, 2, 3, 24, 23, 27, 16, 15, 26, ~
## $ Education                <fct> 2, 1, 2, 4, 1, 2, 3, 1, 3, 3, 3, 2, 1, 2, 3, ~
## $ EducationField           <fct> Life Sciences, Life Sciences, Other, Life Sci~
## $ EmployeeCount            <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ~
## $ EmployeeNumber           <dbl> 1, 2, 4, 5, 7, 8, 10, 11, 12, 13, 14, 15, 16,~
## $ EnvironmentSatisfaction  <dbl> 2, 3, 4, 4, 1, 4, 3, 4, 4, 3, 1, 4, 1, 2, 3, ~
## $ Gender                   <fct> Female, Male, Male, Female, Male, Male, Femal~
## $ HourlyRate               <dbl> 94, 61, 92, 56, 40, 79, 81, 67, 44, 94, 84, 4~
## $ JobInvolvement           <dbl> 3, 2, 2, 3, 3, 3, 4, 3, 2, 3, 4, 2, 3, 3, 2, ~
## $ JobLevel                 <fct> 2, 2, 1, 1, 1, 1, 1, 1, 3, 2, 1, 2, 1, 1, 1, ~
## $ JobRole                  <fct> Sales Executive, Research Scientist, Laborato~
## $ JobSatisfaction          <dbl> 4, 2, 3, 3, 2, 4, 1, 3, 3, 3, 2, 3, 3, 4, 3, ~
## $ MaritalStatus            <fct> Single, Married, Single, Married, Married, Si~
## $ MonthlyIncome            <dbl> 5993, 5130, 2090, 2909, 3468, 3068, 2670, 269~
## $ MonthlyRate              <dbl> 19479, 24907, 2396, 23159, 16632, 11864, 9964~
## $ NumCompaniesWorked       <dbl> 8, 1, 6, 1, 9, 0, 4, 1, 0, 6, 0, 0, 1, 0, 5, ~
## $ Over18                   <fct> Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, ~
## $ OverTime                 <fct> Yes, No, Yes, Yes, No, No, Yes, No, No, No, N~
## $ PercentSalaryHike        <dbl> 11, 23, 15, 11, 12, 13, 20, 22, 21, 13, 13, 1~
## $ PerformanceRating        <dbl> 3, 4, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 3, 3, 3, ~
## $ RelationshipSatisfaction <dbl> 1, 4, 2, 3, 4, 3, 1, 2, 2, 2, 3, 4, 4, 3, 2, ~
## $ StandardHours            <dbl> 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 8~
## $ StockOptionLevel         <fct> 0, 1, 0, 0, 1, 0, 3, 1, 0, 2, 1, 0, 1, 1, 0, ~
## $ TotalWorkingYears        <dbl> 8, 10, 7, 8, 6, 8, 12, 1, 10, 17, 6, 10, 5, 3~
## $ TrainingTimesLastYear    <dbl> 0, 3, 3, 3, 3, 2, 3, 2, 2, 3, 5, 3, 1, 2, 4, ~
## $ WorkLifeBalance          <dbl> 1, 3, 3, 3, 3, 2, 2, 3, 3, 2, 3, 3, 2, 3, 3, ~
## $ YearsAtCompany           <dbl> 6, 10, 0, 8, 2, 7, 1, 1, 9, 7, 5, 9, 5, 2, 4,~
## $ YearsInCurrentRole       <dbl> 4, 7, 0, 7, 2, 7, 0, 0, 7, 7, 4, 5, 2, 2, 2, ~
## $ YearsSinceLastPromotion  <dbl> 0, 1, 0, 3, 2, 3, 0, 0, 1, 7, 0, 0, 4, 1, 0, ~
## $ YearsWithCurrManager     <dbl> 5, 7, 0, 0, 2, 6, 0, 0, 8, 7, 3, 8, 3, 2, 3, ~

Next, we’ll perform a standard 80/20 split of the data for model validation later on.

dt = sort(sample(nrow(workset), nrow(workset)*.8))
train <- workset[dt,]
test <- workset[-dt,]

Kaplan-Meirer Survival Curve

surv.obj <- Surv(train$YearsAtCompany, train$Attrition)
surv.fit <- survfit(surv.obj ~ 1)
summary(surv.fit)

## Call: survfit(formula = surv.obj ~ 1)
## 
##  time n.risk n.event survival std.err lower 95% CI upper 95% CI
##     0   1176      14    0.988 0.00316        0.982        0.994
##     1   1137      51    0.944 0.00678        0.931        0.957
##     2   1001      20    0.925 0.00784        0.910        0.940
##     3    893      15    0.909 0.00868        0.893        0.927
##     4    789      17    0.890 0.00970        0.871        0.909
##     5    695      19    0.865 0.01093        0.844        0.887
##     6    539       7    0.854 0.01158        0.832        0.877
##     7    489       9    0.839 0.01250        0.814        0.863
##     8    414       7    0.824 0.01339        0.799        0.851
##     9    357       5    0.813 0.01416        0.785        0.841
##    10    295      14    0.774 0.01683        0.742        0.808
##    11    196       1    0.770 0.01720        0.737        0.805
##    13    161       2    0.761 0.01827        0.726        0.797
##    15    128       1    0.755 0.01907        0.718        0.793
##    17    102       1    0.747 0.02026        0.709        0.788
##    18     95       1    0.739 0.02152        0.698        0.783
##    19     84       1    0.731 0.02300        0.687        0.777
##    20     74       1    0.721 0.02472        0.674        0.771
##    21     55       1    0.708 0.02752        0.656        0.764
##    22     45       1    0.692 0.03108        0.634        0.756
##    24     32       1    0.670 0.03687        0.602        0.747
##    31     14       1    0.622 0.05746        0.519        0.746
##    32     12       1    0.571 0.07239        0.445        0.732
##    33      9       1    0.507 0.08783        0.361        0.712
##    40      1       1    0.000     NaN           NA           NA

ggsurvplot(surv.fit,data = train, color = "#2E9FDF",
           ggtheme = theme_minimal())

From this we see an overall survival slope for the entire population of company employees. Not super helpful for making acctionable decisions on its own but does give us some baseline idea of the employee lifecycle. For instance, half of all hires are still with the company after 30 years.

We can also begin to segment data here to compare separate conditons between each other. For example we can look at the difference between levels of our JobRole variable.

surv_job_role <- survfit(Surv(YearsAtCompany, Attrition) ~ JobRole, data = train)
ggsurv<-ggsurvplot(surv_job_role,
           pval = FALSE, # add p-values
           conf.int = FALSE, # if we wanted confident intervals, maybe not with this many levels of the JobRole variable
           risk.table = FALSE, # Add risk table
           risk.table.col = "strata", # Change risk table color by groups
           linetype = "strata", # Change line type by groups
           ggtheme = theme_bw()) # Change ggplot2 theme

ggsurv$plot +theme_bw() + 
  theme (legend.position = "right")

By gender,

surv_job_role <- survfit(Surv(YearsAtCompany, Attrition) ~ Gender, data = train)
ggsurvplot(surv_job_role,
           pval = TRUE, conf.int = TRUE,
           risk.table = TRUE,
           risk.table.col = "strata",
           linetype = "strata", 
           ggtheme = theme_bw())

By Department,

ggsurvplot(survfit(Surv(YearsAtCompany, Attrition) ~ Department, data = train), data = train, conf.int = FALSE)

and Stock Options,

surv_job_role <- survfit(Surv(YearsAtCompany, Attrition) ~ StockOptionLevel, data = train)
ggsurv2<-ggsurvplot(surv_job_role,
           pval = TRUE, conf.int = FALSE,
           risk.table = FALSE, # Add risk table
           risk.table.col = "strata", # Change risk table color by groups
           linetype = "strata", # Change line type by groups
           ggtheme = theme_bw()) # Change ggplot2 theme

ggsurv2$plot +theme_bw() + 
  theme (legend.position = "right")

While visuals are helpuful, statistical analysis can help decipher more complex features.

summary(train$JobRole)

## Healthcare Representative           Human Resources     Laboratory Technician 
##                       102                        38                       202 
##                   Manager    Manufacturing Director         Research Director 
##                        81                       119                        63 
##        Research Scientist           Sales Executive      Sales Representative 
##                       240                       268                        63

survdiff(Surv(YearsAtCompany, Attrition) ~ JobRole, data = train) # Tells us if differences exist between groups (based on logrank test of survival curves)

## Call:
## survdiff(formula = Surv(YearsAtCompany, Attrition) ~ JobRole, 
##     data = train)
## 
##                                     N Observed Expected (O-E)^2/E (O-E)^2/V
## JobRole=Healthcare Representative 102        7    20.19     8.615     10.36
## JobRole=Human Resources            38       10     5.19     4.460      4.72
## JobRole=Laboratory Technician     202       48    26.97    16.390     19.79
## JobRole=Manager                    81        5    21.84    12.987     16.45
## JobRole=Manufacturing Director    119        8    20.93     7.991      9.22
## JobRole=Research Director          63        2    13.99    10.274     11.51
## JobRole=Research Scientist        240       44    31.55     4.909      6.15
## JobRole=Sales Executive           268       43    46.79     0.307      0.42
## JobRole=Sales Representative       63       26     5.54    75.517     81.08
## 
##  Chisq= 156  on 8 degrees of freedom, p= <2e-16

train$JobRole <- relevel(train$JobRole, ref = "Sales Executive") # Because this is our largest group, we set this as our "referent" group for comparison in the subsequent code
cox.model.jobrole <-  coxph(formula = surv.obj ~ JobRole, data = train) # why we needed to relevel()
cox.zph(cox.model.jobrole) # Assumption Test

##         chisq df     p
## JobRole  14.2  8 0.076
## GLOBAL   14.2  8 0.076

summary(cox.model.jobrole) # Tells us which groups are different and by how much

## Call:
## coxph(formula = surv.obj ~ JobRole, data = train)
## 
##   n= 1176, number of events= 193 
## 
##                                     coef exp(coef) se(coef)      z Pr(>|z|)    
## JobRoleHealthcare Representative -1.1839    0.3061   0.4387 -2.699 0.006955 ** 
## JobRoleHuman Resources            0.8382    2.3122   0.3522  2.380 0.017332 *  
## JobRoleLaboratory Technician      0.7363    2.0881   0.2112  3.486 0.000490 ***
## JobRoleManager                   -1.7807    0.1685   0.4956 -3.593 0.000327 ***
## JobRoleManufacturing Director    -0.9083    0.4032   0.3854 -2.357 0.018438 *  
## JobRoleResearch Director         -2.0654    0.1268   0.7284 -2.836 0.004574 ** 
## JobRoleResearch Scientist         0.5127    1.6697   0.2165  2.368 0.017903 *  
## JobRoleSales Representative       1.8477    6.3453   0.2553  7.236 4.61e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##                                  exp(coef) exp(-coef) lower .95 upper .95
## JobRoleHealthcare Representative    0.3061     3.2672   0.12955    0.7231
## JobRoleHuman Resources              2.3122     0.4325   1.15930    4.6117
## JobRoleLaboratory Technician        2.0881     0.4789   1.38034    3.1589
## JobRoleManager                      0.1685     5.9341   0.06379    0.4452
## JobRoleManufacturing Director       0.4032     2.4800   0.18945    0.8582
## JobRoleResearch Director            0.1268     7.8884   0.03041    0.5285
## JobRoleResearch Scientist           1.6697     0.5989   1.09228    2.5524
## JobRoleSales Representative         6.3453     0.1576   3.84683   10.4664
## 
## Concordance= 0.742  (se = 0.018 )
## Likelihood ratio test= 136.2  on 8 df,   p=<2e-16
## Wald test            = 116  on 8 df,   p=<2e-16
## Score (logrank) test = 159.4  on 8 df,   p=<2e-16

fit<- survfit(Surv(YearsAtCompany,Attrition)~JobRole, data = train)
summary(fit) # Provides a detailed look at survival curves by group

## Call: survfit(formula = Surv(YearsAtCompany, Attrition) ~ JobRole, 
##     data = train)
## 
##                 JobRole=Sales Executive 
##  time n.risk n.event survival std.err lower 95% CI upper 95% CI
##     0    268       3    0.989 0.00643        0.976        1.000
##     1    261       2    0.981 0.00832        0.965        0.998
##     2    250       6    0.958 0.01249        0.934        0.982
##     3    230       3    0.945 0.01426        0.918        0.974
##     4    212       5    0.923 0.01706        0.890        0.957
##     5    193       5    0.899 0.01968        0.861        0.938
##     6    148       3    0.881 0.02192        0.839        0.925
##     7    135       2    0.868 0.02345        0.823        0.915
##     8    113       4    0.837 0.02719        0.785        0.892
##     9     92       1    0.828 0.02838        0.774        0.885
##    10     76       4    0.784 0.03424        0.720        0.854
##    11     43       1    0.766 0.03799        0.695        0.844
##    13     33       1    0.743 0.04336        0.663        0.833
##    15     24       1    0.712 0.05143        0.618        0.820
##    19     13       1    0.657 0.07086        0.532        0.812
##    21      7       1    0.563 0.10604        0.389        0.815
## 
##                 JobRole=Healthcare Representative 
##  time n.risk n.event survival std.err lower 95% CI upper 95% CI
##     1    100       3    0.970  0.0171        0.937        1.000
##     9     41       1    0.946  0.0287        0.892        1.000
##    10     35       1    0.919  0.0386        0.847        0.998
##    18     11       1    0.836  0.0871        0.681        1.000
##    40      1       1    0.000     NaN           NA           NA
## 
##                 JobRole=Human Resources 
##  time n.risk n.event survival std.err lower 95% CI upper 95% CI
##     1     38       4    0.895  0.0498        0.802        0.998
##     2     34       2    0.842  0.0592        0.734        0.966
##     3     26       1    0.810  0.0651        0.692        0.948
##     5     19       1    0.767  0.0744        0.634        0.928
##     7     11       1    0.697  0.0948        0.534        0.910
##    20      1       1    0.000     NaN           NA           NA
## 
##                 JobRole=Laboratory Technician 
##  time n.risk n.event survival std.err lower 95% CI upper 95% CI
##     0    202       5    0.975  0.0109        0.954        0.997
##     1    193      18    0.884  0.0227        0.841        0.930
##     2    155       4    0.861  0.0248        0.814        0.911
##     3    135       3    0.842  0.0266        0.792        0.896
##     4    113       6    0.798  0.0308        0.739        0.860
##     5     94       4    0.764  0.0339        0.700        0.833
##     6     68       1    0.752  0.0352        0.687        0.825
##     7     62       3    0.716  0.0393        0.643        0.797
##     9     43       2    0.683  0.0439        0.602        0.774
##    10     31       1    0.661  0.0477        0.573        0.761
##    17      7       1    0.566  0.0965        0.406        0.791
## 
##                 JobRole=Manager 
##  time n.risk n.event survival std.err lower 95% CI upper 95% CI
##     5     66       1    0.985  0.0150        0.956            1
##     7     57       1    0.968  0.0226        0.924            1
##    10     48       1    0.947  0.0298        0.891            1
##    24     18       1    0.895  0.0584        0.787            1
##    32      7       1    0.767  0.1285        0.552            1
## 
##                 JobRole=Manufacturing Director 
##  time n.risk n.event survival std.err lower 95% CI upper 95% CI
##     1    117       2    0.983  0.0120        0.960        1.000
##     3    102       1    0.973  0.0153        0.944        1.000
##     4     90       1    0.962  0.0185        0.927        0.999
##    10     35       3    0.880  0.0486        0.790        0.981
##    33      1       1    0.000     NaN           NA           NA
## 
##                 JobRole=Research Director 
##  time n.risk n.event survival std.err lower 95% CI upper 95% CI
##    22      9       1    0.889   0.105        0.706            1
##    31      3       1    0.593   0.252        0.258            1
## 
##                 JobRole=Research Scientist 
##  time n.risk n.event survival std.err lower 95% CI upper 95% CI
##     0    240       2    0.992 0.00587        0.980        1.000
##     1    231      11    0.944 0.01498        0.916        0.974
##     2    192       6    0.915 0.01874        0.879        0.952
##     3    164       2    0.904 0.02010        0.865        0.944
##     4    142       3    0.885 0.02250        0.842        0.930
##     5    117       6    0.839 0.02795        0.786        0.896
##     6     78       3    0.807 0.03250        0.746        0.873
##     7     61       2    0.781 0.03642        0.712        0.855
##     8     50       3    0.734 0.04312        0.654        0.823
##     9     39       1    0.715 0.04594        0.630        0.811
##    10     32       4    0.626 0.05799        0.522        0.750
##    13     12       1    0.573 0.07291        0.447        0.736
## 
##                 JobRole=Sales Representative 
##  time n.risk n.event survival std.err lower 95% CI upper 95% CI
##     0     63       4    0.937  0.0307        0.878        0.999
##     1     56      11    0.753  0.0555        0.651        0.870
##     2     40       2    0.715  0.0588        0.609        0.840
##     3     26       5    0.577  0.0728        0.451        0.739
##     4     16       2    0.505  0.0796        0.371        0.688
##     5     11       2    0.413  0.0877        0.273        0.627

Cox Proportional Hazards Model

Our next step would be to understand the impact of additional covariates on attrition liklihood. For that, we’ll use the Cox Proportional Hazards model to model a survival curve, but with the logic of a regression.

The following code represents all variables in our data thrown into a model at once.

cox.model.full <- coxph(formula = surv.obj ~ Age + BusinessTravel + DailyRate + Department + DistanceFromHome + Education + EducationField +  EnvironmentSatisfaction + Gender + HourlyRate + JobInvolvement + JobLevel +
                          JobRole + JobSatisfaction + MaritalStatus + MonthlyIncome + MonthlyRate + NumCompaniesWorked + OverTime +
                          PercentSalaryHike + PerformanceRating + RelationshipSatisfaction + StockOptionLevel + TotalWorkingYears +
                          TrainingTimesLastYear + WorkLifeBalance + YearsInCurrentRole + YearsSinceLastPromotion + YearsWithCurrManager, data = train)

## Warning in fitter(X, Y, istrat, offset, init, control, weights = weights, :
## Loglik converged before variable 5,6,26 ; coefficient may be infinite.

As we might expect, based on the error message we recieved, there may be something fishy going on with our data. Ideally, we should test some additonal assumptions before exploring the model, but we will test those shortly.

Keep in mind that in an in an exponential model (which this is), anything greater than e+01 is really large.

summary(cox.model.full)

## Call:
## coxph(formula = surv.obj ~ Age + BusinessTravel + DailyRate + 
##     Department + DistanceFromHome + Education + EducationField + 
##     EnvironmentSatisfaction + Gender + HourlyRate + JobInvolvement + 
##     JobLevel + JobRole + JobSatisfaction + MaritalStatus + MonthlyIncome + 
##     MonthlyRate + NumCompaniesWorked + OverTime + PercentSalaryHike + 
##     PerformanceRating + RelationshipSatisfaction + StockOptionLevel + 
##     TotalWorkingYears + TrainingTimesLastYear + WorkLifeBalance + 
##     YearsInCurrentRole + YearsSinceLastPromotion + YearsWithCurrManager, 
##     data = train)
## 
##   n= 1176, number of events= 193 
## 
##                                        coef  exp(coef)   se(coef)      z
## Age                              -2.431e-02  9.760e-01  1.269e-02 -1.915
## BusinessTravelTravel_Frequently   1.636e+00  5.134e+00  3.920e-01  4.173
## BusinessTravelTravel_Rarely       8.958e-01  2.449e+00  3.688e-01  2.429
## DailyRate                        -6.955e-05  9.999e-01  1.955e-04 -0.356
## DepartmentResearch & Development  1.681e+01  1.999e+07  1.557e+03  0.011
## DepartmentSales                   1.629e+01  1.193e+07  1.557e+03  0.010
## DistanceFromHome                  2.203e-02  1.022e+00  9.422e-03  2.338
## Education2                        4.348e-01  1.545e+00  2.989e-01  1.454
## Education3                        5.586e-01  1.748e+00  2.604e-01  2.145
## Education4                        3.107e-01  1.364e+00  2.882e-01  1.078
## Education5                        4.037e-01  1.497e+00  6.296e-01  0.641
## EducationFieldLife Sciences       4.336e-01  1.543e+00  7.210e-01  0.601
## EducationFieldMarketing           1.023e+00  2.782e+00  7.586e-01  1.349
## EducationFieldMedical             2.421e-01  1.274e+00  7.202e-01  0.336
## EducationFieldOther               4.905e-01  1.633e+00  8.194e-01  0.599
## EducationFieldTechnical Degree    7.794e-01  2.180e+00  7.363e-01  1.058
## EnvironmentSatisfaction          -3.345e-01  7.157e-01  6.924e-02 -4.831
## GenderMale                        4.216e-01  1.524e+00  1.657e-01  2.544
## HourlyRate                        7.283e-03  1.007e+00  4.228e-03  1.722
## JobInvolvement                   -2.508e-01  7.782e-01  1.074e-01 -2.335
## JobLevel2                        -1.147e+00  3.175e-01  4.568e-01 -2.511
## JobLevel3                        -2.535e-01  7.761e-01  6.962e-01 -0.364
## JobLevel4                        -9.918e-01  3.709e-01  1.218e+00 -0.815
## JobLevel5                         1.252e+00  3.499e+00  1.667e+00  0.751
## JobRoleHealthcare Representative -1.471e+00  2.298e-01  1.175e+00 -1.252
## JobRoleHuman Resources            1.616e+01  1.043e+07  1.557e+03  0.010
## JobRoleLaboratory Technician     -5.782e-01  5.609e-01  1.146e+00 -0.504
## JobRoleManager                   -1.596e+00  2.028e-01  1.306e+00 -1.222
## JobRoleManufacturing Director    -7.715e-01  4.623e-01  1.133e+00 -0.681
## JobRoleResearch Director         -2.774e+00  6.238e-02  1.599e+00 -1.735
## JobRoleResearch Scientist        -1.574e+00  2.072e-01  1.158e+00 -1.360
## JobRoleSales Representative       5.740e-02  1.059e+00  4.985e-01  0.115
## JobSatisfaction                  -3.893e-01  6.776e-01  7.274e-02 -5.351
## MaritalStatusMarried             -6.191e-02  9.400e-01  2.557e-01 -0.242
## MaritalStatusSingle               2.179e-03  1.002e+00  3.519e-01  0.006
## MonthlyIncome                    -3.774e-05  1.000e+00  8.498e-05 -0.444
## MonthlyRate                       2.428e-05  1.000e+00  1.137e-05  2.135
## NumCompaniesWorked                2.437e-01  1.276e+00  3.345e-02  7.286
## OverTimeYes                       1.332e+00  3.789e+00  1.633e-01  8.159
## PercentSalaryHike                 1.392e-02  1.014e+00  3.534e-02  0.394
## PerformanceRating                 1.086e-01  1.115e+00  3.463e-01  0.314
## RelationshipSatisfaction         -1.886e-01  8.282e-01  7.592e-02 -2.484
## StockOptionLevel1                -1.043e+00  3.525e-01  2.824e-01 -3.693
## StockOptionLevel2                -1.235e+00  2.909e-01  4.236e-01 -2.915
## StockOptionLevel3                -6.708e-01  5.113e-01  3.932e-01 -1.706
## TotalWorkingYears                -1.371e-01  8.719e-01  3.449e-02 -3.975
## TrainingTimesLastYear            -2.243e-01  7.991e-01  6.998e-02 -3.205
## WorkLifeBalance                  -1.502e-01  8.606e-01  1.083e-01 -1.387
## YearsInCurrentRole               -4.014e-01  6.694e-01  4.385e-02 -9.154
## YearsSinceLastPromotion           3.963e-02  1.040e+00  4.021e-02  0.986
## YearsWithCurrManager             -3.696e-01  6.910e-01  4.142e-02 -8.924
##                                  Pr(>|z|)    
## Age                              0.055505 .  
## BusinessTravelTravel_Frequently  3.00e-05 ***
## BusinessTravelTravel_Rarely      0.015161 *  
## DailyRate                        0.721984    
## DepartmentResearch & Development 0.991387    
## DepartmentSales                  0.991652    
## DistanceFromHome                 0.019378 *  
## Education2                       0.145845    
## Education3                       0.031938 *  
## Education4                       0.281007    
## Education5                       0.521433    
## EducationFieldLife Sciences      0.547557    
## EducationFieldMarketing          0.177339    
## EducationFieldMedical            0.736770    
## EducationFieldOther              0.549421    
## EducationFieldTechnical Degree   0.289858    
## EnvironmentSatisfaction          1.36e-06 ***
## GenderMale                       0.010964 *  
## HourlyRate                       0.084982 .  
## JobInvolvement                   0.019565 *  
## JobLevel2                        0.012026 *  
## JobLevel3                        0.715749    
## JobLevel4                        0.415348    
## JobLevel5                        0.452623    
## JobRoleHealthcare Representative 0.210726    
## JobRoleHuman Resources           0.991721    
## JobRoleLaboratory Technician     0.614028    
## JobRoleManager                   0.221818    
## JobRoleManufacturing Director    0.495988    
## JobRoleResearch Director         0.082775 .  
## JobRoleResearch Scientist        0.173951    
## JobRoleSales Representative      0.908331    
## JobSatisfaction                  8.73e-08 ***
## MaritalStatusMarried             0.808717    
## MaritalStatusSingle              0.995060    
## MonthlyIncome                    0.656916    
## MonthlyRate                      0.032779 *  
## NumCompaniesWorked               3.19e-13 ***
## OverTimeYes                      3.37e-16 ***
## PercentSalaryHike                0.693721    
## PerformanceRating                0.753784    
## RelationshipSatisfaction         0.013000 *  
## StockOptionLevel1                0.000222 ***
## StockOptionLevel2                0.003560 ** 
## StockOptionLevel3                0.087965 .  
## TotalWorkingYears                7.03e-05 ***
## TrainingTimesLastYear            0.001351 ** 
## WorkLifeBalance                  0.165450    
## YearsInCurrentRole                < 2e-16 ***
## YearsSinceLastPromotion          0.324306    
## YearsWithCurrManager              < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##                                  exp(coef) exp(-coef) lower .95 upper .95
## Age                              9.760e-01  1.025e+00  0.952003    1.0006
## BusinessTravelTravel_Frequently  5.134e+00  1.948e-01  2.381197   11.0689
## BusinessTravelTravel_Rarely      2.449e+00  4.083e-01  1.188659    5.0464
## DailyRate                        9.999e-01  1.000e+00  0.999547    1.0003
## DepartmentResearch & Development 1.999e+07  5.002e-08  0.000000       Inf
## DepartmentSales                  1.193e+07  8.384e-08  0.000000       Inf
## DistanceFromHome                 1.022e+00  9.782e-01  1.003570    1.0413
## Education2                       1.545e+00  6.474e-01  0.859726    2.7750
## Education3                       1.748e+00  5.720e-01  1.049410    2.9122
## Education4                       1.364e+00  7.330e-01  0.775585    2.4000
## Education5                       1.497e+00  6.679e-01  0.435904    5.1431
## EducationFieldLife Sciences      1.543e+00  6.481e-01  0.375479    6.3398
## EducationFieldMarketing          2.782e+00  3.594e-01  0.629098   12.3061
## EducationFieldMedical            1.274e+00  7.850e-01  0.310538    5.2258
## EducationFieldOther              1.633e+00  6.123e-01  0.327740    8.1386
## EducationFieldTechnical Degree   2.180e+00  4.587e-01  0.514877    9.2308
## EnvironmentSatisfaction          7.157e-01  1.397e+00  0.624855    0.8197
## GenderMale                       1.524e+00  6.560e-01  1.101600    2.1094
## HourlyRate                       1.007e+00  9.927e-01  0.998996    1.0157
## JobInvolvement                   7.782e-01  1.285e+00  0.630484    0.9606
## JobLevel2                        3.175e-01  3.150e+00  0.129680    0.7773
## JobLevel3                        7.761e-01  1.289e+00  0.198307    3.0372
## JobLevel4                        3.709e-01  2.696e+00  0.034105    4.0339
## JobLevel5                        3.499e+00  2.858e-01  0.133208   91.8880
## JobRoleHealthcare Representative 2.298e-01  4.352e+00  0.022967    2.2988
## JobRoleHuman Resources           1.043e+07  9.587e-08  0.000000       Inf
## JobRoleLaboratory Technician     5.609e-01  1.783e+00  0.059296    5.3059
## JobRoleManager                   2.028e-01  4.932e+00  0.015674    2.6229
## JobRoleManufacturing Director    4.623e-01  2.163e+00  0.050164    4.2610
## JobRoleResearch Director         6.238e-02  1.603e+01  0.002715    1.4335
## JobRoleResearch Scientist        2.072e-01  4.827e+00  0.021417    2.0039
## JobRoleSales Representative      1.059e+00  9.442e-01  0.398632    2.8138
## JobSatisfaction                  6.776e-01  1.476e+00  0.587532    0.7814
## MaritalStatusMarried             9.400e-01  1.064e+00  0.569430    1.5516
## MaritalStatusSingle              1.002e+00  9.978e-01  0.502801    1.9975
## MonthlyIncome                    1.000e+00  1.000e+00  0.999796    1.0001
## MonthlyRate                      1.000e+00  1.000e+00  1.000002    1.0000
## NumCompaniesWorked               1.276e+00  7.837e-01  1.195010    1.3624
## OverTimeYes                      3.789e+00  2.639e-01  2.751392    5.2178
## PercentSalaryHike                1.014e+00  9.862e-01  0.946161    1.0867
## PerformanceRating                1.115e+00  8.971e-01  0.565484    2.1974
## RelationshipSatisfaction         8.282e-01  1.208e+00  0.713656    0.9610
## StockOptionLevel1                3.525e-01  2.837e+00  0.202677    0.6130
## StockOptionLevel2                2.909e-01  3.437e+00  0.126820    0.6673
## StockOptionLevel3                5.113e-01  1.956e+00  0.236604    1.1049
## TotalWorkingYears                8.719e-01  1.147e+00  0.814867    0.9328
## TrainingTimesLastYear            7.991e-01  1.251e+00  0.696675    0.9166
## WorkLifeBalance                  8.606e-01  1.162e+00  0.696017    1.0640
## YearsInCurrentRole               6.694e-01  1.494e+00  0.614253    0.7295
## YearsSinceLastPromotion          1.040e+00  9.611e-01  0.961579    1.1257
## YearsWithCurrManager             6.910e-01  1.447e+00  0.637107    0.7494
## 
## Concordance= 0.958  (se = 0.006 )
## Likelihood ratio test= 739.7  on 51 df,   p=<2e-16
## Wald test            = 399.9  on 51 df,   p=<2e-16
## Score (logrank) test = 615.9  on 51 df,   p=<2e-16

Looking at the first column (coef) we see rather large coefficents for some Department and JobRole variables. If we look back at the graphs of these variables, we see that their respective curves, cross. This is likely indicative of a key violation of the proportional hazards assumption.

Our next step then will be to try a more limited model which excludes those variables.

cox.model.simple <- coxph(formula = surv.obj ~ Age + BusinessTravel + DailyRate + DistanceFromHome + Education + EducationField +  EnvironmentSatisfaction + Gender + HourlyRate + JobInvolvement + JobLevel +
                             JobSatisfaction + MaritalStatus + MonthlyIncome + MonthlyRate + NumCompaniesWorked + OverTime +
                            PercentSalaryHike + PerformanceRating + RelationshipSatisfaction + StockOptionLevel + TotalWorkingYears +
                            TrainingTimesLastYear + WorkLifeBalance + YearsInCurrentRole + YearsSinceLastPromotion + YearsWithCurrManager, data = train)

summary(cox.model.simple)

## Call:
## coxph(formula = surv.obj ~ Age + BusinessTravel + DailyRate + 
##     DistanceFromHome + Education + EducationField + EnvironmentSatisfaction + 
##     Gender + HourlyRate + JobInvolvement + JobLevel + JobSatisfaction + 
##     MaritalStatus + MonthlyIncome + MonthlyRate + NumCompaniesWorked + 
##     OverTime + PercentSalaryHike + PerformanceRating + RelationshipSatisfaction + 
##     StockOptionLevel + TotalWorkingYears + TrainingTimesLastYear + 
##     WorkLifeBalance + YearsInCurrentRole + YearsSinceLastPromotion + 
##     YearsWithCurrManager, data = train)
## 
##   n= 1176, number of events= 193 
## 
##                                       coef  exp(coef)   se(coef)      z
## Age                             -2.273e-02  9.775e-01  1.263e-02 -1.800
## BusinessTravelTravel_Frequently  1.497e+00  4.470e+00  3.758e-01  3.985
## BusinessTravelTravel_Rarely      8.676e-01  2.381e+00  3.561e-01  2.437
## DailyRate                       -5.566e-05  9.999e-01  1.924e-04 -0.289
## DistanceFromHome                 2.059e-02  1.021e+00  9.415e-03  2.187
## Education2                       4.378e-01  1.549e+00  2.907e-01  1.506
## Education3                       4.654e-01  1.593e+00  2.549e-01  1.826
## Education4                       2.674e-01  1.307e+00  2.779e-01  0.962
## Education5                       1.771e-01  1.194e+00  5.985e-01  0.296
## EducationFieldLife Sciences      1.625e-02  1.016e+00  5.162e-01  0.031
## EducationFieldMarketing          9.482e-01  2.581e+00  5.435e-01  1.745
## EducationFieldMedical           -2.042e-01  8.153e-01  5.178e-01 -0.394
## EducationFieldOther              2.537e-01  1.289e+00  6.324e-01  0.401
## EducationFieldTechnical Degree   3.796e-01  1.462e+00  5.437e-01  0.698
## EnvironmentSatisfaction         -3.168e-01  7.285e-01  6.911e-02 -4.584
## GenderMale                       3.683e-01  1.445e+00  1.619e-01  2.274
## HourlyRate                       4.034e-03  1.004e+00  4.122e-03  0.979
## JobInvolvement                  -3.068e-01  7.358e-01  1.044e-01 -2.939
## JobLevel2                       -6.778e-01  5.078e-01  3.010e-01 -2.251
## JobLevel3                        4.284e-01  1.535e+00  6.041e-01  0.709
## JobLevel4                       -2.032e-01  8.161e-01  1.162e+00 -0.175
## JobLevel5                        1.187e+00  3.277e+00  1.450e+00  0.818
## JobSatisfaction                 -3.758e-01  6.867e-01  7.189e-02 -5.228
## MaritalStatusMarried            -1.373e-01  8.717e-01  2.512e-01 -0.547
## MaritalStatusSingle              6.423e-02  1.066e+00  3.414e-01  0.188
## MonthlyIncome                   -9.862e-05  9.999e-01  7.810e-05 -1.263
## MonthlyRate                      2.365e-05  1.000e+00  1.109e-05  2.133
## NumCompaniesWorked               2.280e-01  1.256e+00  3.216e-02  7.090
## OverTimeYes                      1.186e+00  3.274e+00  1.563e-01  7.586
## PercentSalaryHike                1.469e-02  1.015e+00  3.476e-02  0.422
## PerformanceRating                1.255e-02  1.013e+00  3.449e-01  0.036
## RelationshipSatisfaction        -1.803e-01  8.350e-01  7.425e-02 -2.428
## StockOptionLevel1               -9.984e-01  3.685e-01  2.776e-01 -3.597
## StockOptionLevel2               -1.180e+00  3.073e-01  4.115e-01 -2.867
## StockOptionLevel3               -6.080e-01  5.444e-01  3.888e-01 -1.564
## TotalWorkingYears               -1.516e-01  8.594e-01  3.422e-02 -4.428
## TrainingTimesLastYear           -1.775e-01  8.374e-01  6.717e-02 -2.642
## WorkLifeBalance                 -1.828e-01  8.330e-01  1.049e-01 -1.742
## YearsInCurrentRole              -3.718e-01  6.895e-01  4.246e-02 -8.757
## YearsSinceLastPromotion          1.509e-02  1.015e+00  3.828e-02  0.394
## YearsWithCurrManager            -3.511e-01  7.039e-01  4.023e-02 -8.727
##                                 Pr(>|z|)    
## Age                             0.071866 .  
## BusinessTravelTravel_Frequently 6.75e-05 ***
## BusinessTravelTravel_Rarely     0.014820 *  
## DailyRate                       0.772307    
## DistanceFromHome                0.028731 *  
## Education2                      0.132090    
## Education3                      0.067878 .  
## Education4                      0.336038    
## Education5                      0.767316    
## EducationFieldLife Sciences     0.974879    
## EducationFieldMarketing         0.081061 .  
## EducationFieldMedical           0.693239    
## EducationFieldOther             0.688243    
## EducationFieldTechnical Degree  0.485109    
## EnvironmentSatisfaction         4.57e-06 ***
## GenderMale                      0.022943 *  
## HourlyRate                      0.327775    
## JobInvolvement                  0.003291 ** 
## JobLevel2                       0.024356 *  
## JobLevel3                       0.478244    
## JobLevel4                       0.861202    
## JobLevel5                       0.413176    
## JobSatisfaction                 1.72e-07 ***
## MaritalStatusMarried            0.584693    
## MaritalStatusSingle             0.850767    
## MonthlyIncome                   0.206706    
## MonthlyRate                     0.032943 *  
## NumCompaniesWorked              1.34e-12 ***
## OverTimeYes                     3.30e-14 ***
## PercentSalaryHike               0.672709    
## PerformanceRating               0.970964    
## RelationshipSatisfaction        0.015193 *  
## StockOptionLevel1               0.000322 ***
## StockOptionLevel2               0.004141 ** 
## StockOptionLevel3               0.117829    
## TotalWorkingYears               9.50e-06 ***
## TrainingTimesLastYear           0.008235 ** 
## WorkLifeBalance                 0.081557 .  
## YearsInCurrentRole               < 2e-16 ***
## YearsSinceLastPromotion         0.693358    
## YearsWithCurrManager             < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##                                 exp(coef) exp(-coef) lower .95 upper .95
## Age                                0.9775     1.0230   0.95363    1.0020
## BusinessTravelTravel_Frequently    4.4702     0.2237   2.14022    9.3366
## BusinessTravelTravel_Rarely        2.3813     0.4199   1.18503    4.7851
## DailyRate                          0.9999     1.0001   0.99957    1.0003
## DistanceFromHome                   1.0208     0.9796   1.00214    1.0398
## Education2                         1.5492     0.6455   0.87635    2.7388
## Education3                         1.5927     0.6279   0.96638    2.6249
## Education4                         1.3065     0.7654   0.75779    2.2526
## Education5                         1.1937     0.8377   0.36938    3.8578
## EducationFieldLife Sciences        1.0164     0.9839   0.36956    2.7953
## EducationFieldMarketing            2.5811     0.3874   0.88952    7.4893
## EducationFieldMedical              0.8153     1.2266   0.29551    2.2492
## EducationFieldOther                1.2888     0.7759   0.37316    4.4514
## EducationFieldTechnical Degree     1.4617     0.6842   0.50354    4.2428
## EnvironmentSatisfaction            0.7285     1.3727   0.63621    0.8342
## GenderMale                         1.4453     0.6919   1.05223    1.9851
## HourlyRate                         1.0040     0.9960   0.99596    1.0122
## JobInvolvement                     0.7358     1.3591   0.59966    0.9028
## JobLevel2                          0.5078     1.9695   0.28146    0.9160
## JobLevel3                          1.5348     0.6516   0.46973    5.0146
## JobLevel4                          0.8161     1.2253   0.08365    7.9619
## JobLevel5                          3.2768     0.3052   0.19094   56.2343
## JobSatisfaction                    0.6867     1.4562   0.59647    0.7906
## MaritalStatusMarried               0.8717     1.1472   0.53274    1.4263
## MaritalStatusSingle                1.0663     0.9378   0.54611    2.0822
## MonthlyIncome                      0.9999     1.0001   0.99975    1.0001
## MonthlyRate                        1.0000     1.0000   1.00000    1.0000
## NumCompaniesWorked                 1.2561     0.7961   1.17939    1.3379
## OverTimeYes                        3.2741     0.3054   2.40997    4.4481
## PercentSalaryHike                  1.0148     0.9854   0.94795    1.0863
## PerformanceRating                  1.0126     0.9875   0.51508    1.9908
## RelationshipSatisfaction           0.8350     1.1975   0.72195    0.9659
## StockOptionLevel1                  0.3685     2.7139   0.21385    0.6349
## StockOptionLevel2                  0.3073     3.2536   0.13721    0.6885
## StockOptionLevel3                  0.5444     1.8368   0.25410    1.1664
## TotalWorkingYears                  0.8594     1.1636   0.80361    0.9190
## TrainingTimesLastYear              0.8374     1.1942   0.73409    0.9552
## WorkLifeBalance                    0.8330     1.2005   0.67814    1.0232
## YearsInCurrentRole                 0.6895     1.4504   0.63443    0.7493
## YearsSinceLastPromotion            1.0152     0.9850   0.94183    1.0943
## YearsWithCurrManager               0.7039     1.4206   0.65053    0.7617
## 
## Concordance= 0.954  (se = 0.006 )
## Likelihood ratio test= 705.3  on 41 df,   p=<2e-16
## Wald test            = 438.6  on 41 df,   p=<2e-16
## Score (logrank) test = 575.7  on 41 df,   p=<2e-16

Here our coeffiecents are not too large to give us cause for concern. Our next step is to evaluate those model assumption tests of our data.

The first we’ll look at is the proportional hazards (PH) assumption. Since the Cox model evaluates events over time, if certain variables behave differently at different time points (i.e., age has one kind of effect on attrition when age > 35 but different effects when age < 35) that would violate a key model assumption.The proportional hazards assumption can be checked using statistical tests and graphical diagnostics based on the scaled Schoenfeld residuals. To conduct this test, we’ll use the cox.zph() formula.

For each covariate, the function cox.zph() correlates the corresponding set of scaled Schoenfeld residuals with time, to test for independence between residuals and time. Additionally, it performs a global test for the model as a whole. Ideally, we want all of the corresponding values to be non-significant.

cox.zph(cox.model.simple)

##                             chisq df       p
## Age                        0.0273  1  0.8689
## BusinessTravel            12.9727  2  0.0015
## DailyRate                  3.8145  1  0.0508
## DistanceFromHome           0.0263  1  0.8712
## Education                  3.2334  4  0.5196
## EducationField            28.9422  5 2.4e-05
## EnvironmentSatisfaction    0.5387  1  0.4630
## Gender                     2.1770  1  0.1401
## HourlyRate                 0.7451  1  0.3880
## JobInvolvement             0.9012  1  0.3424
## JobLevel                  11.8422  4  0.0186
## JobSatisfaction            0.4432  1  0.5056
## MaritalStatus              2.5961  2  0.2731
## MonthlyIncome              0.6448  1  0.4220
## MonthlyRate                0.1186  1  0.7306
## NumCompaniesWorked         3.5422  1  0.0598
## OverTime                   0.6879  1  0.4069
## PercentSalaryHike          0.4551  1  0.4999
## PerformanceRating          0.0354  1  0.8507
## RelationshipSatisfaction   0.6910  1  0.4058
## StockOptionLevel           5.3265  3  0.1494
## TotalWorkingYears          1.3649  1  0.2427
## TrainingTimesLastYear      0.9031  1  0.3420
## WorkLifeBalance            0.8317  1  0.3618
## YearsInCurrentRole        82.9276  1 < 2e-16
## YearsSinceLastPromotion    0.6691  1  0.4134
## YearsWithCurrManager      21.3374  1 3.9e-06
## GLOBAL                   181.5072 41 < 2e-16

The global score for our model should be non-significant (which is very much not the case here) to indicate we did not violate our test assumptions. We can go ahead and see which individual variables are also driving the GLOBAL stat.

Based on the model assumption test, we have a new model that excludes the variables that didn’t pass the test.

cox.model.specified <-  coxph(formula = surv.obj ~ Age +  DailyRate + DistanceFromHome + Education + EnvironmentSatisfaction + HourlyRate + JobInvolvement + JobLevel + JobSatisfaction + MaritalStatus + MonthlyIncome + MonthlyRate + OverTime +
                                PercentSalaryHike + PerformanceRating + RelationshipSatisfaction + StockOptionLevel + TotalWorkingYears + TrainingTimesLastYear + WorkLifeBalance +  YearsSinceLastPromotion, data = train)

cox.zph(cox.model.specified)

##                             chisq df       p
## Age                       1.04573  1 0.30649
## DailyRate                 2.21677  1 0.13652
## DistanceFromHome          0.00386  1 0.95049
## Education                 3.91352  4 0.41784
## EnvironmentSatisfaction   0.91316  1 0.33928
## HourlyRate                0.08422  1 0.77166
## JobInvolvement            0.60139  1 0.43805
## JobLevel                 13.33479  4 0.00975
## JobSatisfaction           1.11252  1 0.29153
## MaritalStatus             4.00768  2 0.13482
## MonthlyIncome             1.21766  1 0.26982
## MonthlyRate               0.37310  1 0.54132
## OverTime                  1.46734  1 0.22577
## PercentSalaryHike         0.01284  1 0.90978
## PerformanceRating         0.06648  1 0.79654
## RelationshipSatisfaction  0.60586  1 0.43635
## StockOptionLevel          5.26629  3 0.15330
## TotalWorkingYears         0.12557  1 0.72307
## TrainingTimesLastYear     0.89243  1 0.34482
## WorkLifeBalance           0.38761  1 0.53356
## YearsSinceLastPromotion  11.17136  1 0.00083
## GLOBAL                   49.87591 30 0.01278

So our model is still not passing the PH test, but we’re doing better. Here we only need to remove JobLevel and YearsSinceLastPromotion.

cox.model.specified <-  coxph(formula = surv.obj ~  DailyRate + DistanceFromHome + Education + EnvironmentSatisfaction + HourlyRate + JobInvolvement +  JobSatisfaction + MonthlyIncome + MonthlyRate + OverTime +
                                PercentSalaryHike + PerformanceRating + RelationshipSatisfaction + StockOptionLevel + TotalWorkingYears +
                                TrainingTimesLastYear + WorkLifeBalance + Age, data = train)
cox.zph(cox.model.specified)

##                            chisq df    p
## DailyRate                 1.1148  1 0.29
## DistanceFromHome          0.1641  1 0.69
## Education                 2.9774  4 0.56
## EnvironmentSatisfaction   0.6634  1 0.42
## HourlyRate                0.1239  1 0.72
## JobInvolvement            0.8622  1 0.35
## JobSatisfaction           0.6491  1 0.42
## MonthlyIncome             1.5187  1 0.22
## MonthlyRate               0.9599  1 0.33
## OverTime                  1.6071  1 0.20
## PercentSalaryHike         0.2847  1 0.59
## PerformanceRating         0.0422  1 0.84
## RelationshipSatisfaction  0.3228  1 0.57
## StockOptionLevel          4.1197  3 0.25
## TotalWorkingYears         0.4274  1 0.51
## TrainingTimesLastYear     0.9655  1 0.33
## WorkLifeBalance           0.3766  1 0.54
## Age                       1.6740  1 0.20
## GLOBAL                   18.8075 23 0.71

Success! Our model is now passing the PH test. It should be noted that just because a variable doesn’t pass this PH test it has to be excluded - we can multiply the variable by an interaction with time, we could change the “form” of the variable (i.e., take a continuous variable and convert to bins/factors) and eliminate extreme values.

Now let’s examine the summary output in more detail for our final model.

summary(cox.model.specified)

## Call:
## coxph(formula = surv.obj ~ DailyRate + DistanceFromHome + Education + 
##     EnvironmentSatisfaction + HourlyRate + JobInvolvement + JobSatisfaction + 
##     MonthlyIncome + MonthlyRate + OverTime + PercentSalaryHike + 
##     PerformanceRating + RelationshipSatisfaction + StockOptionLevel + 
##     TotalWorkingYears + TrainingTimesLastYear + WorkLifeBalance + 
##     Age, data = train)
## 
##   n= 1176, number of events= 193 
## 
##                                coef  exp(coef)   se(coef)      z Pr(>|z|)    
## DailyRate                -1.603e-04  9.998e-01  1.864e-04 -0.860 0.389684    
## DistanceFromHome          2.680e-02  1.027e+00  8.716e-03  3.074 0.002109 ** 
## Education2                5.443e-02  1.056e+00  2.793e-01  0.195 0.845513    
## Education3                3.503e-01  1.420e+00  2.377e-01  1.474 0.140603    
## Education4                1.788e-01  1.196e+00  2.652e-01  0.674 0.500241    
## Education5               -3.881e-01  6.784e-01  5.727e-01 -0.678 0.498008    
## EnvironmentSatisfaction  -2.365e-01  7.894e-01  6.445e-02 -3.669 0.000244 ***
## HourlyRate               -2.729e-03  9.973e-01  3.811e-03 -0.716 0.473975    
## JobInvolvement           -2.921e-01  7.467e-01  9.968e-02 -2.930 0.003389 ** 
## JobSatisfaction          -3.262e-01  7.217e-01  6.708e-02 -4.863 1.16e-06 ***
## MonthlyIncome            -6.692e-05  9.999e-01  3.503e-05 -1.911 0.056067 .  
## MonthlyRate               1.068e-05  1.000e+00  1.019e-05  1.047 0.294901    
## OverTimeYes               1.278e+00  3.590e+00  1.497e-01  8.537  < 2e-16 ***
## PercentSalaryHike         3.502e-03  1.004e+00  3.195e-02  0.110 0.912731    
## PerformanceRating         8.860e-02  1.093e+00  3.244e-01  0.273 0.784769    
## RelationshipSatisfaction -9.085e-02  9.132e-01  6.727e-02 -1.351 0.176826    
## StockOptionLevel1        -9.119e-01  4.018e-01  1.749e-01 -5.215 1.84e-07 ***
## StockOptionLevel2        -1.106e+00  3.309e-01  3.328e-01 -3.323 0.000891 ***
## StockOptionLevel3        -4.341e-01  6.479e-01  3.200e-01 -1.357 0.174927    
## TotalWorkingYears        -2.854e-01  7.517e-01  3.115e-02 -9.160  < 2e-16 ***
## TrainingTimesLastYear    -1.100e-01  8.959e-01  6.237e-02 -1.763 0.077905 .  
## WorkLifeBalance          -1.434e-01  8.664e-01  1.016e-01 -1.411 0.158198    
## Age                       8.193e-03  1.008e+00  1.144e-02  0.716 0.473961    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##                          exp(coef) exp(-coef) lower .95 upper .95
## DailyRate                   0.9998     1.0002    0.9995    1.0002
## DistanceFromHome            1.0272     0.9736    1.0098    1.0449
## Education2                  1.0559     0.9470    0.6108    1.8256
## Education3                  1.4195     0.7045    0.8908    2.2621
## Education4                  1.1957     0.8363    0.7111    2.0108
## Education5                  0.6784     1.4742    0.2208    2.0843
## EnvironmentSatisfaction     0.7894     1.2668    0.6957    0.8957
## HourlyRate                  0.9973     1.0027    0.9899    1.0048
## JobInvolvement              0.7467     1.3392    0.6142    0.9078
## JobSatisfaction             0.7217     1.3857    0.6327    0.8231
## MonthlyIncome               0.9999     1.0001    0.9999    1.0000
## MonthlyRate                 1.0000     1.0000    1.0000    1.0000
## OverTimeYes                 3.5904     0.2785    2.6773    4.8150
## PercentSalaryHike           1.0035     0.9965    0.9426    1.0684
## PerformanceRating           1.0926     0.9152    0.5785    2.0636
## RelationshipSatisfaction    0.9132     1.0951    0.8004    1.0418
## StockOptionLevel1           0.4018     2.4890    0.2852    0.5660
## StockOptionLevel2           0.3309     3.0220    0.1723    0.6353
## StockOptionLevel3           0.6479     1.5436    0.3460    1.2130
## TotalWorkingYears           0.7517     1.3303    0.7072    0.7991
## TrainingTimesLastYear       0.8959     1.1162    0.7928    1.0124
## WorkLifeBalance             0.8664     1.1542    0.7100    1.0573
## Age                         1.0082     0.9918    0.9859    1.0311
## 
## Concordance= 0.877  (se = 0.015 )
## Likelihood ratio test= 433  on 23 df,   p=<2e-16
## Wald test            = 308.2  on 23 df,   p=<2e-16
## Score (logrank) test = 344.4  on 23 df,   p=<2e-16

We can now look at a few key takeaways from the model. At the bottom of the output we have 3 tests for the global significance: Liklihood ratio tests, Wald test, and Logrank score. In our model, all three global sig tests would lead us to conclude that the model we have derived is significant (p < .00). The next step is then to determine which variables from our model are significantly predicting attrition. The column labled Pr(>|z|) indicates p-values at the individual variable level. To the right of the column, we can easily see which variables are significant.

The next step is determining how the variables are influencing towards our dependent variable (in our case, attrition). Our DV is coded as 0 = No attrit, 1 = Attrit. We can then look at the column labled coef and determine that if the number is positive, it is influencing towards Attrition while a negative value is influencing towards retention. We can then look at the column directly to the right, exp(coef), to calculate the hazards ratio (HR). The hazard ratios of covariates are interpretable as multiplicative effects on the hazard,holding the other covariates constant. The simple way to calculate/interpret the HR is to subtract the number in the exp(coef) colunmn from 1 and convert to a percentage. For example, JobSatisfaction exp(coef)= 0.72. We can then say that each unit increase of JobSatisfaction (1-5, higher scores indicate higher job satisfaction) has a 28% (1 - 0.72 = .28) of increasing retention. Similarly, we can look at OverTimeYes, exp(coef) = 3.644, subtract 1 from 3.644 (for ease of interpretation) and we get 2.64. After conversion to a percentage, we can estimate that, holding all other variables equal, being an overtime employee is associated with a 264% increased risk of attrition!

Before we get too caught up in our model interpretation, we should continue testing key assumptions. We already have the PH assumption out of the way so let’s explore others.

Additional Assumption Testing

We can test the impact of influential observations with the code:

ggcoxdiagnostics(cox.model.specified, type = "dfbeta",
                 linear.predictions = FALSE, ggtheme = theme_bw())

## `geom_smooth()` using formula 'y ~ x'

and we can inspect residuals with:

ggcoxdiagnostics(cox.model.specified, type = "deviance",
                 linear.predictions = FALSE, ggtheme = theme_bw())

## `geom_smooth()` using formula 'y ~ x'

The above residuals should give us pause and stimulate discussion about why we might be observing this particular pattern.

We can also test for linearity of our continuous variables. Plotting the Martingale residuals against continuous covariates is a common approach used to detect nonlinearity or, in other words, to assess the functional form of a covariate (does our variable violate a key assumption of the test and therefore warrant some type of transformation). The function ggcoxfunctional() displays graphs of continuous covariates against martingale residuals of null cox proportional hazards model. Essentially, we want the fitted lowess lines to demonstrate a linear relation with the plotted data.

ggcoxfunctional(Surv(YearsAtCompany, Attrition) ~ TotalWorkingYears +  DistanceFromHome + EnvironmentSatisfaction +  JobInvolvement + JobSatisfaction + MonthlyIncome, data = train)

## Warning: arguments formula is deprecated; will be removed in the next version;
## please use fit instead.

In this case we might see some cause for concern in EnvironmentSatisfaction, DistanceFromHome, and MonthlyIncome. In some senses, this can make sense. Self-report satisfaction surveys tend to skew positive. Why don’t we try the same assumption test on a log transformed version of MonthlyIncome, to try and account for skew.

ggcoxfunctional(Surv(YearsAtCompany, Attrition) ~ log(MonthlyIncome), data = train)

## Warning: arguments formula is deprecated; will be removed in the next version;
## please use fit instead.

Using our Data to Develop Predictive Models and Acctionable Insight

The following code essentially takes the Cox model we built earlier from the 80% split of data (train) and sees how well it applies to the 20% data we reserved for testing (test). The corresponding number we pull out is the area-under-the-curve (AUC). This number exists on a 0 to 1 range and essentially tells us how well our model predicts the test data. AUC scores above .5 indicate that the model is predicting 0’s when it should predict 0’s and 1’s when it should predict 1’s. Anything under .5 indicates that our model is predicting 0’s when it should predict 1’s and vice versa.

cox.pred.full <- predict(cox.model.specified, newdata = test, type = "lp")
roc.obj <- survivalROC::survivalROC(Stime = test$YearsAtCompany,
                                    status = test$Attrition,
                                    marker = cox.pred.full,
                                    predict.time = 1,
                                    lambda = 0.003)
roc.obj$AUC

## [1] 0.7994259

While an AUC score of .8 isn’t bad, let’s look at what happens when we use our earlier, full model.

cox.pred.full <- predict(cox.model.full, newdata = test, type = "lp")
roc.obj <- survivalROC::survivalROC(Stime = test$YearsAtCompany,
                                    status = test$Attrition,
                                    marker = cox.pred.full,
                                    predict.time = 1,
                                    lambda = 0.003)
roc.obj$AUC

## [1] 0.9701103

So essentially our model prediction cabability shot up to nearly perfect.

We can continue to apply these predictive features to our data to now guide decision making.

# Inputting risk estimates into our data
train$lp <- predict(cox.model.specified, type = "lp")
train$Risk <- predict(cox.model.specified, type = "risk")
train$Test <- predict(cox.model.full, type = "risk")
train$Exp <- predict(cox.model.specified, type = "expected")
test$Risk <- predict(cox.model.specified, newdata = test, type = "risk")

glimpse(train %>% 
          select(Attrition, lp, Risk, Test, Exp) %>%  arrange(desc(Risk)))

## Rows: 1,176
## Columns: 5
## $ Attrition <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, ~
## $ lp        <dbl> 5.572197, 5.528736, 5.366441, 5.270501, 5.215228, 5.071213, ~
## $ Risk      <dbl> 263.01136, 251.82544, 214.09960, 194.51342, 184.05375, 159.3~
## $ Test      <dbl> 5047.3302, 460.7552, 2568.8619, 1266.7733, 782.6617, 2023.72~
## $ Exp       <dbl> 1.38017879, 1.32147952, 1.12350935, 1.02072892, 2.15397569, ~

glimpse(train %>% 
          select(Attrition, lp, Risk, Test, Exp) %>%  arrange(-desc(Risk)))

## Rows: 1,176
## Columns: 5
## $ Attrition <dbl> 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, ~
## $ lp        <dbl> -9.619404, -8.928902, -8.797936, -8.508254, -8.408845, -8.39~
## $ Risk      <dbl> 0.0000664272, 0.0001325034, 0.0001510445, 0.0002017958, 0.00~
## $ Test      <dbl> 4.315366e-05, 1.296558e-04, 3.346174e-04, 2.825509e-04, 6.87~
## $ Exp       <dbl> 1.012227e+00, 1.921798e-05, 2.780320e-02, 3.714513e-02, 4.10~

glimpse(train %>% 
          select(Attrition, lp, Risk, Test, Exp) %>%  arrange(desc(Test)))

## Rows: 1,176
## Columns: 5
## $ Attrition <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ~
## $ lp        <dbl> 4.832234, 5.572197, 4.759101, 5.366441, 4.807865, 3.397195, ~
## $ Risk      <dbl> 125.491056, 263.011361, 116.640991, 214.099597, 122.469847, ~
## $ Test      <dbl> 8387.3881, 5047.3302, 2571.6417, 2568.8619, 2567.4601, 2218.~
## $ Exp       <dbl> 0.107292957, 1.380178791, 0.612085428, 1.123509348, 0.104709~

glimpse(train %>% 
          select(Attrition, lp, Risk, Test, Exp) %>%  arrange(-desc(Test)))

## Rows: 1,176
## Columns: 5
## $ Attrition <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, ~
## $ lp        <dbl> -5.848341, -4.736709, -7.019900, -6.110899, -7.952284, -7.22~
## $ Risk      <dbl> 0.0028846808, 0.0087674549, 0.0008939145, 0.0022185562, 0.00~
## $ Test      <dbl> 1.745942e-11, 1.266040e-10, 2.107029e-09, 2.585660e-09, 3.52~
## $ Exp       <dbl> 5.309915e-01, 3.807959e-02, 9.224744e-02, 3.217740e-04, 2.03~

Comparing Models with ANOVA

Ideally we should be approaching HR issues with some form of testable hypothesis. One way to accomplish that is testing different nested models.

For example, let’s say we had a hypothesis that Stock Option Level significantly influenced attrition. The following code takes a few variables that we possibly already know are meaningful predictors of attrition (fit), and compares that model to one that includes the addition of Stock Option (fit2).

The ANOVA evaluates whether the additional variable explains enough additional variance in the model to be significant.

# Comparing models with anova()
fit<- coxph(formula = surv.obj ~ BusinessTravel + Gender + JobLevel + JobRole, data = train)
fit2<- coxph(formula = surv.obj ~ BusinessTravel + Gender + JobLevel + JobRole + StockOptionLevel, data = train)
anova(fit, fit2)

## Analysis of Deviance Table
##  Cox model: response is  surv.obj
##  Model 1: ~ BusinessTravel + Gender + JobLevel + JobRole
##  Model 2: ~ BusinessTravel + Gender + JobLevel + JobRole + StockOptionLevel
##    loglik  Chisq Df P(>|Chi|)    
## 1 -1136.6                        
## 2 -1115.4 42.428  3 3.255e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Significant differences here indicate that our second model contributes to a greater proportion of variance explained compared to the first model.

Attrition Presentation

Jon D.

8/7/2020

Kaplan-Meirer Survival Curve

Cox Proportional Hazards Model

Additional Assumption Testing

Using our Data to Develop Predictive Models and Acctionable Insight

Comparing Models with ANOVA