getwd()
## [1] "C:/Users/joncd/Documents"
The following analysis represents an approach to analyzing employee attrition via survival analysis. We will be using a publicly available simulated dataset from IBM which can be downloaded from Kaggle here IBM Data.
Additional helpful links can be found at:
*In Depth Discussion of Survival Analysis in R
We will begin by importing our library, importing our data, and conducting some simple data transformations to make the anlaysis easier on ourselves.
library(readr)
library(tidyverse)
library(survival)
library(survminer)
WA_Fn_UseC_HR_Employee_Attrition <- read_csv("C:/Users/joncd/Downloads/WA_Fn-UseC_-HR-Employee-Attrition.csv")
data <- as_tibble(WA_Fn_UseC_HR_Employee_Attrition)
set.seed(150) # For Reproducibility
# Simple data cleaning
workset <- data %>%
mutate(BusinessTravel = as.factor(BusinessTravel),
Department = as.factor(Department),
Education = as.factor(Education),
EducationField = as.factor(EducationField),
Gender = as.factor(Gender),
JobLevel = as.factor(JobLevel),
JobRole = as.factor(JobRole),
MaritalStatus = as.factor(MaritalStatus),
OverTime = as.factor(OverTime),
Over18 = as.factor(Over18),
StockOptionLevel = as.factor(StockOptionLevel),
Attrition = ifelse(Attrition == "No",0,1))
Let’s take a quick look at our dataset
glimpse(workset)
## Rows: 1,470
## Columns: 35
## $ Age <dbl> 41, 49, 37, 33, 27, 32, 59, 30, 38, 36, 35, 2~
## $ Attrition <dbl> 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, ~
## $ BusinessTravel <fct> Travel_Rarely, Travel_Frequently, Travel_Rare~
## $ DailyRate <dbl> 1102, 279, 1373, 1392, 591, 1005, 1324, 1358,~
## $ Department <fct> Sales, Research & Development, Research & Dev~
## $ DistanceFromHome <dbl> 1, 8, 2, 3, 2, 2, 3, 24, 23, 27, 16, 15, 26, ~
## $ Education <fct> 2, 1, 2, 4, 1, 2, 3, 1, 3, 3, 3, 2, 1, 2, 3, ~
## $ EducationField <fct> Life Sciences, Life Sciences, Other, Life Sci~
## $ EmployeeCount <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ~
## $ EmployeeNumber <dbl> 1, 2, 4, 5, 7, 8, 10, 11, 12, 13, 14, 15, 16,~
## $ EnvironmentSatisfaction <dbl> 2, 3, 4, 4, 1, 4, 3, 4, 4, 3, 1, 4, 1, 2, 3, ~
## $ Gender <fct> Female, Male, Male, Female, Male, Male, Femal~
## $ HourlyRate <dbl> 94, 61, 92, 56, 40, 79, 81, 67, 44, 94, 84, 4~
## $ JobInvolvement <dbl> 3, 2, 2, 3, 3, 3, 4, 3, 2, 3, 4, 2, 3, 3, 2, ~
## $ JobLevel <fct> 2, 2, 1, 1, 1, 1, 1, 1, 3, 2, 1, 2, 1, 1, 1, ~
## $ JobRole <fct> Sales Executive, Research Scientist, Laborato~
## $ JobSatisfaction <dbl> 4, 2, 3, 3, 2, 4, 1, 3, 3, 3, 2, 3, 3, 4, 3, ~
## $ MaritalStatus <fct> Single, Married, Single, Married, Married, Si~
## $ MonthlyIncome <dbl> 5993, 5130, 2090, 2909, 3468, 3068, 2670, 269~
## $ MonthlyRate <dbl> 19479, 24907, 2396, 23159, 16632, 11864, 9964~
## $ NumCompaniesWorked <dbl> 8, 1, 6, 1, 9, 0, 4, 1, 0, 6, 0, 0, 1, 0, 5, ~
## $ Over18 <fct> Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, ~
## $ OverTime <fct> Yes, No, Yes, Yes, No, No, Yes, No, No, No, N~
## $ PercentSalaryHike <dbl> 11, 23, 15, 11, 12, 13, 20, 22, 21, 13, 13, 1~
## $ PerformanceRating <dbl> 3, 4, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 3, 3, 3, ~
## $ RelationshipSatisfaction <dbl> 1, 4, 2, 3, 4, 3, 1, 2, 2, 2, 3, 4, 4, 3, 2, ~
## $ StandardHours <dbl> 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 8~
## $ StockOptionLevel <fct> 0, 1, 0, 0, 1, 0, 3, 1, 0, 2, 1, 0, 1, 1, 0, ~
## $ TotalWorkingYears <dbl> 8, 10, 7, 8, 6, 8, 12, 1, 10, 17, 6, 10, 5, 3~
## $ TrainingTimesLastYear <dbl> 0, 3, 3, 3, 3, 2, 3, 2, 2, 3, 5, 3, 1, 2, 4, ~
## $ WorkLifeBalance <dbl> 1, 3, 3, 3, 3, 2, 2, 3, 3, 2, 3, 3, 2, 3, 3, ~
## $ YearsAtCompany <dbl> 6, 10, 0, 8, 2, 7, 1, 1, 9, 7, 5, 9, 5, 2, 4,~
## $ YearsInCurrentRole <dbl> 4, 7, 0, 7, 2, 7, 0, 0, 7, 7, 4, 5, 2, 2, 2, ~
## $ YearsSinceLastPromotion <dbl> 0, 1, 0, 3, 2, 3, 0, 0, 1, 7, 0, 0, 4, 1, 0, ~
## $ YearsWithCurrManager <dbl> 5, 7, 0, 0, 2, 6, 0, 0, 8, 7, 3, 8, 3, 2, 3, ~
Next, we’ll perform a standard 80/20 split of the data for model validation later on.
dt = sort(sample(nrow(workset), nrow(workset)*.8))
train <- workset[dt,]
test <- workset[-dt,]
surv.obj <- Surv(train$YearsAtCompany, train$Attrition)
surv.fit <- survfit(surv.obj ~ 1)
summary(surv.fit)
## Call: survfit(formula = surv.obj ~ 1)
##
## time n.risk n.event survival std.err lower 95% CI upper 95% CI
## 0 1176 14 0.988 0.00316 0.982 0.994
## 1 1137 51 0.944 0.00678 0.931 0.957
## 2 1001 20 0.925 0.00784 0.910 0.940
## 3 893 15 0.909 0.00868 0.893 0.927
## 4 789 17 0.890 0.00970 0.871 0.909
## 5 695 19 0.865 0.01093 0.844 0.887
## 6 539 7 0.854 0.01158 0.832 0.877
## 7 489 9 0.839 0.01250 0.814 0.863
## 8 414 7 0.824 0.01339 0.799 0.851
## 9 357 5 0.813 0.01416 0.785 0.841
## 10 295 14 0.774 0.01683 0.742 0.808
## 11 196 1 0.770 0.01720 0.737 0.805
## 13 161 2 0.761 0.01827 0.726 0.797
## 15 128 1 0.755 0.01907 0.718 0.793
## 17 102 1 0.747 0.02026 0.709 0.788
## 18 95 1 0.739 0.02152 0.698 0.783
## 19 84 1 0.731 0.02300 0.687 0.777
## 20 74 1 0.721 0.02472 0.674 0.771
## 21 55 1 0.708 0.02752 0.656 0.764
## 22 45 1 0.692 0.03108 0.634 0.756
## 24 32 1 0.670 0.03687 0.602 0.747
## 31 14 1 0.622 0.05746 0.519 0.746
## 32 12 1 0.571 0.07239 0.445 0.732
## 33 9 1 0.507 0.08783 0.361 0.712
## 40 1 1 0.000 NaN NA NA
ggsurvplot(surv.fit,data = train, color = "#2E9FDF",
ggtheme = theme_minimal())
From this we see an overall survival slope for the entire population of company employees. Not super helpful for making acctionable decisions on its own but does give us some baseline idea of the employee lifecycle. For instance, half of all hires are still with the company after 30 years.
We can also begin to segment data here to compare separate conditons between each other. For example we can look at the difference between levels of our JobRole variable.
surv_job_role <- survfit(Surv(YearsAtCompany, Attrition) ~ JobRole, data = train)
ggsurv<-ggsurvplot(surv_job_role,
pval = FALSE, # add p-values
conf.int = FALSE, # if we wanted confident intervals, maybe not with this many levels of the JobRole variable
risk.table = FALSE, # Add risk table
risk.table.col = "strata", # Change risk table color by groups
linetype = "strata", # Change line type by groups
ggtheme = theme_bw()) # Change ggplot2 theme
ggsurv$plot +theme_bw() +
theme (legend.position = "right")
By gender,
surv_job_role <- survfit(Surv(YearsAtCompany, Attrition) ~ Gender, data = train)
ggsurvplot(surv_job_role,
pval = TRUE, conf.int = TRUE,
risk.table = TRUE,
risk.table.col = "strata",
linetype = "strata",
ggtheme = theme_bw())
By Department,
ggsurvplot(survfit(Surv(YearsAtCompany, Attrition) ~ Department, data = train), data = train, conf.int = FALSE)
and Stock Options,
surv_job_role <- survfit(Surv(YearsAtCompany, Attrition) ~ StockOptionLevel, data = train)
ggsurv2<-ggsurvplot(surv_job_role,
pval = TRUE, conf.int = FALSE,
risk.table = FALSE, # Add risk table
risk.table.col = "strata", # Change risk table color by groups
linetype = "strata", # Change line type by groups
ggtheme = theme_bw()) # Change ggplot2 theme
ggsurv2$plot +theme_bw() +
theme (legend.position = "right")
While visuals are helpuful, statistical analysis can help decipher more complex features.
summary(train$JobRole)
## Healthcare Representative Human Resources Laboratory Technician
## 102 38 202
## Manager Manufacturing Director Research Director
## 81 119 63
## Research Scientist Sales Executive Sales Representative
## 240 268 63
survdiff(Surv(YearsAtCompany, Attrition) ~ JobRole, data = train) # Tells us if differences exist between groups (based on logrank test of survival curves)
## Call:
## survdiff(formula = Surv(YearsAtCompany, Attrition) ~ JobRole,
## data = train)
##
## N Observed Expected (O-E)^2/E (O-E)^2/V
## JobRole=Healthcare Representative 102 7 20.19 8.615 10.36
## JobRole=Human Resources 38 10 5.19 4.460 4.72
## JobRole=Laboratory Technician 202 48 26.97 16.390 19.79
## JobRole=Manager 81 5 21.84 12.987 16.45
## JobRole=Manufacturing Director 119 8 20.93 7.991 9.22
## JobRole=Research Director 63 2 13.99 10.274 11.51
## JobRole=Research Scientist 240 44 31.55 4.909 6.15
## JobRole=Sales Executive 268 43 46.79 0.307 0.42
## JobRole=Sales Representative 63 26 5.54 75.517 81.08
##
## Chisq= 156 on 8 degrees of freedom, p= <2e-16
train$JobRole <- relevel(train$JobRole, ref = "Sales Executive") # Because this is our largest group, we set this as our "referent" group for comparison in the subsequent code
cox.model.jobrole <- coxph(formula = surv.obj ~ JobRole, data = train) # why we needed to relevel()
cox.zph(cox.model.jobrole) # Assumption Test
## chisq df p
## JobRole 14.2 8 0.076
## GLOBAL 14.2 8 0.076
summary(cox.model.jobrole) # Tells us which groups are different and by how much
## Call:
## coxph(formula = surv.obj ~ JobRole, data = train)
##
## n= 1176, number of events= 193
##
## coef exp(coef) se(coef) z Pr(>|z|)
## JobRoleHealthcare Representative -1.1839 0.3061 0.4387 -2.699 0.006955 **
## JobRoleHuman Resources 0.8382 2.3122 0.3522 2.380 0.017332 *
## JobRoleLaboratory Technician 0.7363 2.0881 0.2112 3.486 0.000490 ***
## JobRoleManager -1.7807 0.1685 0.4956 -3.593 0.000327 ***
## JobRoleManufacturing Director -0.9083 0.4032 0.3854 -2.357 0.018438 *
## JobRoleResearch Director -2.0654 0.1268 0.7284 -2.836 0.004574 **
## JobRoleResearch Scientist 0.5127 1.6697 0.2165 2.368 0.017903 *
## JobRoleSales Representative 1.8477 6.3453 0.2553 7.236 4.61e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## JobRoleHealthcare Representative 0.3061 3.2672 0.12955 0.7231
## JobRoleHuman Resources 2.3122 0.4325 1.15930 4.6117
## JobRoleLaboratory Technician 2.0881 0.4789 1.38034 3.1589
## JobRoleManager 0.1685 5.9341 0.06379 0.4452
## JobRoleManufacturing Director 0.4032 2.4800 0.18945 0.8582
## JobRoleResearch Director 0.1268 7.8884 0.03041 0.5285
## JobRoleResearch Scientist 1.6697 0.5989 1.09228 2.5524
## JobRoleSales Representative 6.3453 0.1576 3.84683 10.4664
##
## Concordance= 0.742 (se = 0.018 )
## Likelihood ratio test= 136.2 on 8 df, p=<2e-16
## Wald test = 116 on 8 df, p=<2e-16
## Score (logrank) test = 159.4 on 8 df, p=<2e-16
fit<- survfit(Surv(YearsAtCompany,Attrition)~JobRole, data = train)
summary(fit) # Provides a detailed look at survival curves by group
## Call: survfit(formula = Surv(YearsAtCompany, Attrition) ~ JobRole,
## data = train)
##
## JobRole=Sales Executive
## time n.risk n.event survival std.err lower 95% CI upper 95% CI
## 0 268 3 0.989 0.00643 0.976 1.000
## 1 261 2 0.981 0.00832 0.965 0.998
## 2 250 6 0.958 0.01249 0.934 0.982
## 3 230 3 0.945 0.01426 0.918 0.974
## 4 212 5 0.923 0.01706 0.890 0.957
## 5 193 5 0.899 0.01968 0.861 0.938
## 6 148 3 0.881 0.02192 0.839 0.925
## 7 135 2 0.868 0.02345 0.823 0.915
## 8 113 4 0.837 0.02719 0.785 0.892
## 9 92 1 0.828 0.02838 0.774 0.885
## 10 76 4 0.784 0.03424 0.720 0.854
## 11 43 1 0.766 0.03799 0.695 0.844
## 13 33 1 0.743 0.04336 0.663 0.833
## 15 24 1 0.712 0.05143 0.618 0.820
## 19 13 1 0.657 0.07086 0.532 0.812
## 21 7 1 0.563 0.10604 0.389 0.815
##
## JobRole=Healthcare Representative
## time n.risk n.event survival std.err lower 95% CI upper 95% CI
## 1 100 3 0.970 0.0171 0.937 1.000
## 9 41 1 0.946 0.0287 0.892 1.000
## 10 35 1 0.919 0.0386 0.847 0.998
## 18 11 1 0.836 0.0871 0.681 1.000
## 40 1 1 0.000 NaN NA NA
##
## JobRole=Human Resources
## time n.risk n.event survival std.err lower 95% CI upper 95% CI
## 1 38 4 0.895 0.0498 0.802 0.998
## 2 34 2 0.842 0.0592 0.734 0.966
## 3 26 1 0.810 0.0651 0.692 0.948
## 5 19 1 0.767 0.0744 0.634 0.928
## 7 11 1 0.697 0.0948 0.534 0.910
## 20 1 1 0.000 NaN NA NA
##
## JobRole=Laboratory Technician
## time n.risk n.event survival std.err lower 95% CI upper 95% CI
## 0 202 5 0.975 0.0109 0.954 0.997
## 1 193 18 0.884 0.0227 0.841 0.930
## 2 155 4 0.861 0.0248 0.814 0.911
## 3 135 3 0.842 0.0266 0.792 0.896
## 4 113 6 0.798 0.0308 0.739 0.860
## 5 94 4 0.764 0.0339 0.700 0.833
## 6 68 1 0.752 0.0352 0.687 0.825
## 7 62 3 0.716 0.0393 0.643 0.797
## 9 43 2 0.683 0.0439 0.602 0.774
## 10 31 1 0.661 0.0477 0.573 0.761
## 17 7 1 0.566 0.0965 0.406 0.791
##
## JobRole=Manager
## time n.risk n.event survival std.err lower 95% CI upper 95% CI
## 5 66 1 0.985 0.0150 0.956 1
## 7 57 1 0.968 0.0226 0.924 1
## 10 48 1 0.947 0.0298 0.891 1
## 24 18 1 0.895 0.0584 0.787 1
## 32 7 1 0.767 0.1285 0.552 1
##
## JobRole=Manufacturing Director
## time n.risk n.event survival std.err lower 95% CI upper 95% CI
## 1 117 2 0.983 0.0120 0.960 1.000
## 3 102 1 0.973 0.0153 0.944 1.000
## 4 90 1 0.962 0.0185 0.927 0.999
## 10 35 3 0.880 0.0486 0.790 0.981
## 33 1 1 0.000 NaN NA NA
##
## JobRole=Research Director
## time n.risk n.event survival std.err lower 95% CI upper 95% CI
## 22 9 1 0.889 0.105 0.706 1
## 31 3 1 0.593 0.252 0.258 1
##
## JobRole=Research Scientist
## time n.risk n.event survival std.err lower 95% CI upper 95% CI
## 0 240 2 0.992 0.00587 0.980 1.000
## 1 231 11 0.944 0.01498 0.916 0.974
## 2 192 6 0.915 0.01874 0.879 0.952
## 3 164 2 0.904 0.02010 0.865 0.944
## 4 142 3 0.885 0.02250 0.842 0.930
## 5 117 6 0.839 0.02795 0.786 0.896
## 6 78 3 0.807 0.03250 0.746 0.873
## 7 61 2 0.781 0.03642 0.712 0.855
## 8 50 3 0.734 0.04312 0.654 0.823
## 9 39 1 0.715 0.04594 0.630 0.811
## 10 32 4 0.626 0.05799 0.522 0.750
## 13 12 1 0.573 0.07291 0.447 0.736
##
## JobRole=Sales Representative
## time n.risk n.event survival std.err lower 95% CI upper 95% CI
## 0 63 4 0.937 0.0307 0.878 0.999
## 1 56 11 0.753 0.0555 0.651 0.870
## 2 40 2 0.715 0.0588 0.609 0.840
## 3 26 5 0.577 0.0728 0.451 0.739
## 4 16 2 0.505 0.0796 0.371 0.688
## 5 11 2 0.413 0.0877 0.273 0.627
Our next step would be to understand the impact of additional covariates on attrition liklihood. For that, we’ll use the Cox Proportional Hazards model to model a survival curve, but with the logic of a regression.
The following code represents all variables in our data thrown into a model at once.
cox.model.full <- coxph(formula = surv.obj ~ Age + BusinessTravel + DailyRate + Department + DistanceFromHome + Education + EducationField + EnvironmentSatisfaction + Gender + HourlyRate + JobInvolvement + JobLevel +
JobRole + JobSatisfaction + MaritalStatus + MonthlyIncome + MonthlyRate + NumCompaniesWorked + OverTime +
PercentSalaryHike + PerformanceRating + RelationshipSatisfaction + StockOptionLevel + TotalWorkingYears +
TrainingTimesLastYear + WorkLifeBalance + YearsInCurrentRole + YearsSinceLastPromotion + YearsWithCurrManager, data = train)
## Warning in fitter(X, Y, istrat, offset, init, control, weights = weights, :
## Loglik converged before variable 5,6,26 ; coefficient may be infinite.
As we might expect, based on the error message we recieved, there may be something fishy going on with our data. Ideally, we should test some additonal assumptions before exploring the model, but we will test those shortly.
Keep in mind that in an in an exponential model (which this is), anything greater than e+01 is really large.
summary(cox.model.full)
## Call:
## coxph(formula = surv.obj ~ Age + BusinessTravel + DailyRate +
## Department + DistanceFromHome + Education + EducationField +
## EnvironmentSatisfaction + Gender + HourlyRate + JobInvolvement +
## JobLevel + JobRole + JobSatisfaction + MaritalStatus + MonthlyIncome +
## MonthlyRate + NumCompaniesWorked + OverTime + PercentSalaryHike +
## PerformanceRating + RelationshipSatisfaction + StockOptionLevel +
## TotalWorkingYears + TrainingTimesLastYear + WorkLifeBalance +
## YearsInCurrentRole + YearsSinceLastPromotion + YearsWithCurrManager,
## data = train)
##
## n= 1176, number of events= 193
##
## coef exp(coef) se(coef) z
## Age -2.431e-02 9.760e-01 1.269e-02 -1.915
## BusinessTravelTravel_Frequently 1.636e+00 5.134e+00 3.920e-01 4.173
## BusinessTravelTravel_Rarely 8.958e-01 2.449e+00 3.688e-01 2.429
## DailyRate -6.955e-05 9.999e-01 1.955e-04 -0.356
## DepartmentResearch & Development 1.681e+01 1.999e+07 1.557e+03 0.011
## DepartmentSales 1.629e+01 1.193e+07 1.557e+03 0.010
## DistanceFromHome 2.203e-02 1.022e+00 9.422e-03 2.338
## Education2 4.348e-01 1.545e+00 2.989e-01 1.454
## Education3 5.586e-01 1.748e+00 2.604e-01 2.145
## Education4 3.107e-01 1.364e+00 2.882e-01 1.078
## Education5 4.037e-01 1.497e+00 6.296e-01 0.641
## EducationFieldLife Sciences 4.336e-01 1.543e+00 7.210e-01 0.601
## EducationFieldMarketing 1.023e+00 2.782e+00 7.586e-01 1.349
## EducationFieldMedical 2.421e-01 1.274e+00 7.202e-01 0.336
## EducationFieldOther 4.905e-01 1.633e+00 8.194e-01 0.599
## EducationFieldTechnical Degree 7.794e-01 2.180e+00 7.363e-01 1.058
## EnvironmentSatisfaction -3.345e-01 7.157e-01 6.924e-02 -4.831
## GenderMale 4.216e-01 1.524e+00 1.657e-01 2.544
## HourlyRate 7.283e-03 1.007e+00 4.228e-03 1.722
## JobInvolvement -2.508e-01 7.782e-01 1.074e-01 -2.335
## JobLevel2 -1.147e+00 3.175e-01 4.568e-01 -2.511
## JobLevel3 -2.535e-01 7.761e-01 6.962e-01 -0.364
## JobLevel4 -9.918e-01 3.709e-01 1.218e+00 -0.815
## JobLevel5 1.252e+00 3.499e+00 1.667e+00 0.751
## JobRoleHealthcare Representative -1.471e+00 2.298e-01 1.175e+00 -1.252
## JobRoleHuman Resources 1.616e+01 1.043e+07 1.557e+03 0.010
## JobRoleLaboratory Technician -5.782e-01 5.609e-01 1.146e+00 -0.504
## JobRoleManager -1.596e+00 2.028e-01 1.306e+00 -1.222
## JobRoleManufacturing Director -7.715e-01 4.623e-01 1.133e+00 -0.681
## JobRoleResearch Director -2.774e+00 6.238e-02 1.599e+00 -1.735
## JobRoleResearch Scientist -1.574e+00 2.072e-01 1.158e+00 -1.360
## JobRoleSales Representative 5.740e-02 1.059e+00 4.985e-01 0.115
## JobSatisfaction -3.893e-01 6.776e-01 7.274e-02 -5.351
## MaritalStatusMarried -6.191e-02 9.400e-01 2.557e-01 -0.242
## MaritalStatusSingle 2.179e-03 1.002e+00 3.519e-01 0.006
## MonthlyIncome -3.774e-05 1.000e+00 8.498e-05 -0.444
## MonthlyRate 2.428e-05 1.000e+00 1.137e-05 2.135
## NumCompaniesWorked 2.437e-01 1.276e+00 3.345e-02 7.286
## OverTimeYes 1.332e+00 3.789e+00 1.633e-01 8.159
## PercentSalaryHike 1.392e-02 1.014e+00 3.534e-02 0.394
## PerformanceRating 1.086e-01 1.115e+00 3.463e-01 0.314
## RelationshipSatisfaction -1.886e-01 8.282e-01 7.592e-02 -2.484
## StockOptionLevel1 -1.043e+00 3.525e-01 2.824e-01 -3.693
## StockOptionLevel2 -1.235e+00 2.909e-01 4.236e-01 -2.915
## StockOptionLevel3 -6.708e-01 5.113e-01 3.932e-01 -1.706
## TotalWorkingYears -1.371e-01 8.719e-01 3.449e-02 -3.975
## TrainingTimesLastYear -2.243e-01 7.991e-01 6.998e-02 -3.205
## WorkLifeBalance -1.502e-01 8.606e-01 1.083e-01 -1.387
## YearsInCurrentRole -4.014e-01 6.694e-01 4.385e-02 -9.154
## YearsSinceLastPromotion 3.963e-02 1.040e+00 4.021e-02 0.986
## YearsWithCurrManager -3.696e-01 6.910e-01 4.142e-02 -8.924
## Pr(>|z|)
## Age 0.055505 .
## BusinessTravelTravel_Frequently 3.00e-05 ***
## BusinessTravelTravel_Rarely 0.015161 *
## DailyRate 0.721984
## DepartmentResearch & Development 0.991387
## DepartmentSales 0.991652
## DistanceFromHome 0.019378 *
## Education2 0.145845
## Education3 0.031938 *
## Education4 0.281007
## Education5 0.521433
## EducationFieldLife Sciences 0.547557
## EducationFieldMarketing 0.177339
## EducationFieldMedical 0.736770
## EducationFieldOther 0.549421
## EducationFieldTechnical Degree 0.289858
## EnvironmentSatisfaction 1.36e-06 ***
## GenderMale 0.010964 *
## HourlyRate 0.084982 .
## JobInvolvement 0.019565 *
## JobLevel2 0.012026 *
## JobLevel3 0.715749
## JobLevel4 0.415348
## JobLevel5 0.452623
## JobRoleHealthcare Representative 0.210726
## JobRoleHuman Resources 0.991721
## JobRoleLaboratory Technician 0.614028
## JobRoleManager 0.221818
## JobRoleManufacturing Director 0.495988
## JobRoleResearch Director 0.082775 .
## JobRoleResearch Scientist 0.173951
## JobRoleSales Representative 0.908331
## JobSatisfaction 8.73e-08 ***
## MaritalStatusMarried 0.808717
## MaritalStatusSingle 0.995060
## MonthlyIncome 0.656916
## MonthlyRate 0.032779 *
## NumCompaniesWorked 3.19e-13 ***
## OverTimeYes 3.37e-16 ***
## PercentSalaryHike 0.693721
## PerformanceRating 0.753784
## RelationshipSatisfaction 0.013000 *
## StockOptionLevel1 0.000222 ***
## StockOptionLevel2 0.003560 **
## StockOptionLevel3 0.087965 .
## TotalWorkingYears 7.03e-05 ***
## TrainingTimesLastYear 0.001351 **
## WorkLifeBalance 0.165450
## YearsInCurrentRole < 2e-16 ***
## YearsSinceLastPromotion 0.324306
## YearsWithCurrManager < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## Age 9.760e-01 1.025e+00 0.952003 1.0006
## BusinessTravelTravel_Frequently 5.134e+00 1.948e-01 2.381197 11.0689
## BusinessTravelTravel_Rarely 2.449e+00 4.083e-01 1.188659 5.0464
## DailyRate 9.999e-01 1.000e+00 0.999547 1.0003
## DepartmentResearch & Development 1.999e+07 5.002e-08 0.000000 Inf
## DepartmentSales 1.193e+07 8.384e-08 0.000000 Inf
## DistanceFromHome 1.022e+00 9.782e-01 1.003570 1.0413
## Education2 1.545e+00 6.474e-01 0.859726 2.7750
## Education3 1.748e+00 5.720e-01 1.049410 2.9122
## Education4 1.364e+00 7.330e-01 0.775585 2.4000
## Education5 1.497e+00 6.679e-01 0.435904 5.1431
## EducationFieldLife Sciences 1.543e+00 6.481e-01 0.375479 6.3398
## EducationFieldMarketing 2.782e+00 3.594e-01 0.629098 12.3061
## EducationFieldMedical 1.274e+00 7.850e-01 0.310538 5.2258
## EducationFieldOther 1.633e+00 6.123e-01 0.327740 8.1386
## EducationFieldTechnical Degree 2.180e+00 4.587e-01 0.514877 9.2308
## EnvironmentSatisfaction 7.157e-01 1.397e+00 0.624855 0.8197
## GenderMale 1.524e+00 6.560e-01 1.101600 2.1094
## HourlyRate 1.007e+00 9.927e-01 0.998996 1.0157
## JobInvolvement 7.782e-01 1.285e+00 0.630484 0.9606
## JobLevel2 3.175e-01 3.150e+00 0.129680 0.7773
## JobLevel3 7.761e-01 1.289e+00 0.198307 3.0372
## JobLevel4 3.709e-01 2.696e+00 0.034105 4.0339
## JobLevel5 3.499e+00 2.858e-01 0.133208 91.8880
## JobRoleHealthcare Representative 2.298e-01 4.352e+00 0.022967 2.2988
## JobRoleHuman Resources 1.043e+07 9.587e-08 0.000000 Inf
## JobRoleLaboratory Technician 5.609e-01 1.783e+00 0.059296 5.3059
## JobRoleManager 2.028e-01 4.932e+00 0.015674 2.6229
## JobRoleManufacturing Director 4.623e-01 2.163e+00 0.050164 4.2610
## JobRoleResearch Director 6.238e-02 1.603e+01 0.002715 1.4335
## JobRoleResearch Scientist 2.072e-01 4.827e+00 0.021417 2.0039
## JobRoleSales Representative 1.059e+00 9.442e-01 0.398632 2.8138
## JobSatisfaction 6.776e-01 1.476e+00 0.587532 0.7814
## MaritalStatusMarried 9.400e-01 1.064e+00 0.569430 1.5516
## MaritalStatusSingle 1.002e+00 9.978e-01 0.502801 1.9975
## MonthlyIncome 1.000e+00 1.000e+00 0.999796 1.0001
## MonthlyRate 1.000e+00 1.000e+00 1.000002 1.0000
## NumCompaniesWorked 1.276e+00 7.837e-01 1.195010 1.3624
## OverTimeYes 3.789e+00 2.639e-01 2.751392 5.2178
## PercentSalaryHike 1.014e+00 9.862e-01 0.946161 1.0867
## PerformanceRating 1.115e+00 8.971e-01 0.565484 2.1974
## RelationshipSatisfaction 8.282e-01 1.208e+00 0.713656 0.9610
## StockOptionLevel1 3.525e-01 2.837e+00 0.202677 0.6130
## StockOptionLevel2 2.909e-01 3.437e+00 0.126820 0.6673
## StockOptionLevel3 5.113e-01 1.956e+00 0.236604 1.1049
## TotalWorkingYears 8.719e-01 1.147e+00 0.814867 0.9328
## TrainingTimesLastYear 7.991e-01 1.251e+00 0.696675 0.9166
## WorkLifeBalance 8.606e-01 1.162e+00 0.696017 1.0640
## YearsInCurrentRole 6.694e-01 1.494e+00 0.614253 0.7295
## YearsSinceLastPromotion 1.040e+00 9.611e-01 0.961579 1.1257
## YearsWithCurrManager 6.910e-01 1.447e+00 0.637107 0.7494
##
## Concordance= 0.958 (se = 0.006 )
## Likelihood ratio test= 739.7 on 51 df, p=<2e-16
## Wald test = 399.9 on 51 df, p=<2e-16
## Score (logrank) test = 615.9 on 51 df, p=<2e-16
Looking at the first column (coef) we see rather large coefficents for some Department and JobRole variables. If we look back at the graphs of these variables, we see that their respective curves, cross. This is likely indicative of a key violation of the proportional hazards assumption.
Our next step then will be to try a more limited model which excludes those variables.
cox.model.simple <- coxph(formula = surv.obj ~ Age + BusinessTravel + DailyRate + DistanceFromHome + Education + EducationField + EnvironmentSatisfaction + Gender + HourlyRate + JobInvolvement + JobLevel +
JobSatisfaction + MaritalStatus + MonthlyIncome + MonthlyRate + NumCompaniesWorked + OverTime +
PercentSalaryHike + PerformanceRating + RelationshipSatisfaction + StockOptionLevel + TotalWorkingYears +
TrainingTimesLastYear + WorkLifeBalance + YearsInCurrentRole + YearsSinceLastPromotion + YearsWithCurrManager, data = train)
summary(cox.model.simple)
## Call:
## coxph(formula = surv.obj ~ Age + BusinessTravel + DailyRate +
## DistanceFromHome + Education + EducationField + EnvironmentSatisfaction +
## Gender + HourlyRate + JobInvolvement + JobLevel + JobSatisfaction +
## MaritalStatus + MonthlyIncome + MonthlyRate + NumCompaniesWorked +
## OverTime + PercentSalaryHike + PerformanceRating + RelationshipSatisfaction +
## StockOptionLevel + TotalWorkingYears + TrainingTimesLastYear +
## WorkLifeBalance + YearsInCurrentRole + YearsSinceLastPromotion +
## YearsWithCurrManager, data = train)
##
## n= 1176, number of events= 193
##
## coef exp(coef) se(coef) z
## Age -2.273e-02 9.775e-01 1.263e-02 -1.800
## BusinessTravelTravel_Frequently 1.497e+00 4.470e+00 3.758e-01 3.985
## BusinessTravelTravel_Rarely 8.676e-01 2.381e+00 3.561e-01 2.437
## DailyRate -5.566e-05 9.999e-01 1.924e-04 -0.289
## DistanceFromHome 2.059e-02 1.021e+00 9.415e-03 2.187
## Education2 4.378e-01 1.549e+00 2.907e-01 1.506
## Education3 4.654e-01 1.593e+00 2.549e-01 1.826
## Education4 2.674e-01 1.307e+00 2.779e-01 0.962
## Education5 1.771e-01 1.194e+00 5.985e-01 0.296
## EducationFieldLife Sciences 1.625e-02 1.016e+00 5.162e-01 0.031
## EducationFieldMarketing 9.482e-01 2.581e+00 5.435e-01 1.745
## EducationFieldMedical -2.042e-01 8.153e-01 5.178e-01 -0.394
## EducationFieldOther 2.537e-01 1.289e+00 6.324e-01 0.401
## EducationFieldTechnical Degree 3.796e-01 1.462e+00 5.437e-01 0.698
## EnvironmentSatisfaction -3.168e-01 7.285e-01 6.911e-02 -4.584
## GenderMale 3.683e-01 1.445e+00 1.619e-01 2.274
## HourlyRate 4.034e-03 1.004e+00 4.122e-03 0.979
## JobInvolvement -3.068e-01 7.358e-01 1.044e-01 -2.939
## JobLevel2 -6.778e-01 5.078e-01 3.010e-01 -2.251
## JobLevel3 4.284e-01 1.535e+00 6.041e-01 0.709
## JobLevel4 -2.032e-01 8.161e-01 1.162e+00 -0.175
## JobLevel5 1.187e+00 3.277e+00 1.450e+00 0.818
## JobSatisfaction -3.758e-01 6.867e-01 7.189e-02 -5.228
## MaritalStatusMarried -1.373e-01 8.717e-01 2.512e-01 -0.547
## MaritalStatusSingle 6.423e-02 1.066e+00 3.414e-01 0.188
## MonthlyIncome -9.862e-05 9.999e-01 7.810e-05 -1.263
## MonthlyRate 2.365e-05 1.000e+00 1.109e-05 2.133
## NumCompaniesWorked 2.280e-01 1.256e+00 3.216e-02 7.090
## OverTimeYes 1.186e+00 3.274e+00 1.563e-01 7.586
## PercentSalaryHike 1.469e-02 1.015e+00 3.476e-02 0.422
## PerformanceRating 1.255e-02 1.013e+00 3.449e-01 0.036
## RelationshipSatisfaction -1.803e-01 8.350e-01 7.425e-02 -2.428
## StockOptionLevel1 -9.984e-01 3.685e-01 2.776e-01 -3.597
## StockOptionLevel2 -1.180e+00 3.073e-01 4.115e-01 -2.867
## StockOptionLevel3 -6.080e-01 5.444e-01 3.888e-01 -1.564
## TotalWorkingYears -1.516e-01 8.594e-01 3.422e-02 -4.428
## TrainingTimesLastYear -1.775e-01 8.374e-01 6.717e-02 -2.642
## WorkLifeBalance -1.828e-01 8.330e-01 1.049e-01 -1.742
## YearsInCurrentRole -3.718e-01 6.895e-01 4.246e-02 -8.757
## YearsSinceLastPromotion 1.509e-02 1.015e+00 3.828e-02 0.394
## YearsWithCurrManager -3.511e-01 7.039e-01 4.023e-02 -8.727
## Pr(>|z|)
## Age 0.071866 .
## BusinessTravelTravel_Frequently 6.75e-05 ***
## BusinessTravelTravel_Rarely 0.014820 *
## DailyRate 0.772307
## DistanceFromHome 0.028731 *
## Education2 0.132090
## Education3 0.067878 .
## Education4 0.336038
## Education5 0.767316
## EducationFieldLife Sciences 0.974879
## EducationFieldMarketing 0.081061 .
## EducationFieldMedical 0.693239
## EducationFieldOther 0.688243
## EducationFieldTechnical Degree 0.485109
## EnvironmentSatisfaction 4.57e-06 ***
## GenderMale 0.022943 *
## HourlyRate 0.327775
## JobInvolvement 0.003291 **
## JobLevel2 0.024356 *
## JobLevel3 0.478244
## JobLevel4 0.861202
## JobLevel5 0.413176
## JobSatisfaction 1.72e-07 ***
## MaritalStatusMarried 0.584693
## MaritalStatusSingle 0.850767
## MonthlyIncome 0.206706
## MonthlyRate 0.032943 *
## NumCompaniesWorked 1.34e-12 ***
## OverTimeYes 3.30e-14 ***
## PercentSalaryHike 0.672709
## PerformanceRating 0.970964
## RelationshipSatisfaction 0.015193 *
## StockOptionLevel1 0.000322 ***
## StockOptionLevel2 0.004141 **
## StockOptionLevel3 0.117829
## TotalWorkingYears 9.50e-06 ***
## TrainingTimesLastYear 0.008235 **
## WorkLifeBalance 0.081557 .
## YearsInCurrentRole < 2e-16 ***
## YearsSinceLastPromotion 0.693358
## YearsWithCurrManager < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## Age 0.9775 1.0230 0.95363 1.0020
## BusinessTravelTravel_Frequently 4.4702 0.2237 2.14022 9.3366
## BusinessTravelTravel_Rarely 2.3813 0.4199 1.18503 4.7851
## DailyRate 0.9999 1.0001 0.99957 1.0003
## DistanceFromHome 1.0208 0.9796 1.00214 1.0398
## Education2 1.5492 0.6455 0.87635 2.7388
## Education3 1.5927 0.6279 0.96638 2.6249
## Education4 1.3065 0.7654 0.75779 2.2526
## Education5 1.1937 0.8377 0.36938 3.8578
## EducationFieldLife Sciences 1.0164 0.9839 0.36956 2.7953
## EducationFieldMarketing 2.5811 0.3874 0.88952 7.4893
## EducationFieldMedical 0.8153 1.2266 0.29551 2.2492
## EducationFieldOther 1.2888 0.7759 0.37316 4.4514
## EducationFieldTechnical Degree 1.4617 0.6842 0.50354 4.2428
## EnvironmentSatisfaction 0.7285 1.3727 0.63621 0.8342
## GenderMale 1.4453 0.6919 1.05223 1.9851
## HourlyRate 1.0040 0.9960 0.99596 1.0122
## JobInvolvement 0.7358 1.3591 0.59966 0.9028
## JobLevel2 0.5078 1.9695 0.28146 0.9160
## JobLevel3 1.5348 0.6516 0.46973 5.0146
## JobLevel4 0.8161 1.2253 0.08365 7.9619
## JobLevel5 3.2768 0.3052 0.19094 56.2343
## JobSatisfaction 0.6867 1.4562 0.59647 0.7906
## MaritalStatusMarried 0.8717 1.1472 0.53274 1.4263
## MaritalStatusSingle 1.0663 0.9378 0.54611 2.0822
## MonthlyIncome 0.9999 1.0001 0.99975 1.0001
## MonthlyRate 1.0000 1.0000 1.00000 1.0000
## NumCompaniesWorked 1.2561 0.7961 1.17939 1.3379
## OverTimeYes 3.2741 0.3054 2.40997 4.4481
## PercentSalaryHike 1.0148 0.9854 0.94795 1.0863
## PerformanceRating 1.0126 0.9875 0.51508 1.9908
## RelationshipSatisfaction 0.8350 1.1975 0.72195 0.9659
## StockOptionLevel1 0.3685 2.7139 0.21385 0.6349
## StockOptionLevel2 0.3073 3.2536 0.13721 0.6885
## StockOptionLevel3 0.5444 1.8368 0.25410 1.1664
## TotalWorkingYears 0.8594 1.1636 0.80361 0.9190
## TrainingTimesLastYear 0.8374 1.1942 0.73409 0.9552
## WorkLifeBalance 0.8330 1.2005 0.67814 1.0232
## YearsInCurrentRole 0.6895 1.4504 0.63443 0.7493
## YearsSinceLastPromotion 1.0152 0.9850 0.94183 1.0943
## YearsWithCurrManager 0.7039 1.4206 0.65053 0.7617
##
## Concordance= 0.954 (se = 0.006 )
## Likelihood ratio test= 705.3 on 41 df, p=<2e-16
## Wald test = 438.6 on 41 df, p=<2e-16
## Score (logrank) test = 575.7 on 41 df, p=<2e-16
Here our coeffiecents are not too large to give us cause for concern. Our next step is to evaluate those model assumption tests of our data.
The first we’ll look at is the proportional hazards (PH) assumption. Since the Cox model evaluates events over time, if certain variables behave differently at different time points (i.e., age has one kind of effect on attrition when age > 35 but different effects when age < 35) that would violate a key model assumption.The proportional hazards assumption can be checked using statistical tests and graphical diagnostics based on the scaled Schoenfeld residuals. To conduct this test, we’ll use the cox.zph() formula.
For each covariate, the function cox.zph() correlates the corresponding set of scaled Schoenfeld residuals with time, to test for independence between residuals and time. Additionally, it performs a global test for the model as a whole. Ideally, we want all of the corresponding values to be non-significant.
cox.zph(cox.model.simple)
## chisq df p
## Age 0.0273 1 0.8689
## BusinessTravel 12.9727 2 0.0015
## DailyRate 3.8145 1 0.0508
## DistanceFromHome 0.0263 1 0.8712
## Education 3.2334 4 0.5196
## EducationField 28.9422 5 2.4e-05
## EnvironmentSatisfaction 0.5387 1 0.4630
## Gender 2.1770 1 0.1401
## HourlyRate 0.7451 1 0.3880
## JobInvolvement 0.9012 1 0.3424
## JobLevel 11.8422 4 0.0186
## JobSatisfaction 0.4432 1 0.5056
## MaritalStatus 2.5961 2 0.2731
## MonthlyIncome 0.6448 1 0.4220
## MonthlyRate 0.1186 1 0.7306
## NumCompaniesWorked 3.5422 1 0.0598
## OverTime 0.6879 1 0.4069
## PercentSalaryHike 0.4551 1 0.4999
## PerformanceRating 0.0354 1 0.8507
## RelationshipSatisfaction 0.6910 1 0.4058
## StockOptionLevel 5.3265 3 0.1494
## TotalWorkingYears 1.3649 1 0.2427
## TrainingTimesLastYear 0.9031 1 0.3420
## WorkLifeBalance 0.8317 1 0.3618
## YearsInCurrentRole 82.9276 1 < 2e-16
## YearsSinceLastPromotion 0.6691 1 0.4134
## YearsWithCurrManager 21.3374 1 3.9e-06
## GLOBAL 181.5072 41 < 2e-16
The global score for our model should be non-significant (which is very much not the case here) to indicate we did not violate our test assumptions. We can go ahead and see which individual variables are also driving the GLOBAL stat.
Based on the model assumption test, we have a new model that excludes the variables that didn’t pass the test.
cox.model.specified <- coxph(formula = surv.obj ~ Age + DailyRate + DistanceFromHome + Education + EnvironmentSatisfaction + HourlyRate + JobInvolvement + JobLevel + JobSatisfaction + MaritalStatus + MonthlyIncome + MonthlyRate + OverTime +
PercentSalaryHike + PerformanceRating + RelationshipSatisfaction + StockOptionLevel + TotalWorkingYears + TrainingTimesLastYear + WorkLifeBalance + YearsSinceLastPromotion, data = train)
cox.zph(cox.model.specified)
## chisq df p
## Age 1.04573 1 0.30649
## DailyRate 2.21677 1 0.13652
## DistanceFromHome 0.00386 1 0.95049
## Education 3.91352 4 0.41784
## EnvironmentSatisfaction 0.91316 1 0.33928
## HourlyRate 0.08422 1 0.77166
## JobInvolvement 0.60139 1 0.43805
## JobLevel 13.33479 4 0.00975
## JobSatisfaction 1.11252 1 0.29153
## MaritalStatus 4.00768 2 0.13482
## MonthlyIncome 1.21766 1 0.26982
## MonthlyRate 0.37310 1 0.54132
## OverTime 1.46734 1 0.22577
## PercentSalaryHike 0.01284 1 0.90978
## PerformanceRating 0.06648 1 0.79654
## RelationshipSatisfaction 0.60586 1 0.43635
## StockOptionLevel 5.26629 3 0.15330
## TotalWorkingYears 0.12557 1 0.72307
## TrainingTimesLastYear 0.89243 1 0.34482
## WorkLifeBalance 0.38761 1 0.53356
## YearsSinceLastPromotion 11.17136 1 0.00083
## GLOBAL 49.87591 30 0.01278
So our model is still not passing the PH test, but we’re doing better. Here we only need to remove JobLevel and YearsSinceLastPromotion.
cox.model.specified <- coxph(formula = surv.obj ~ DailyRate + DistanceFromHome + Education + EnvironmentSatisfaction + HourlyRate + JobInvolvement + JobSatisfaction + MonthlyIncome + MonthlyRate + OverTime +
PercentSalaryHike + PerformanceRating + RelationshipSatisfaction + StockOptionLevel + TotalWorkingYears +
TrainingTimesLastYear + WorkLifeBalance + Age, data = train)
cox.zph(cox.model.specified)
## chisq df p
## DailyRate 1.1148 1 0.29
## DistanceFromHome 0.1641 1 0.69
## Education 2.9774 4 0.56
## EnvironmentSatisfaction 0.6634 1 0.42
## HourlyRate 0.1239 1 0.72
## JobInvolvement 0.8622 1 0.35
## JobSatisfaction 0.6491 1 0.42
## MonthlyIncome 1.5187 1 0.22
## MonthlyRate 0.9599 1 0.33
## OverTime 1.6071 1 0.20
## PercentSalaryHike 0.2847 1 0.59
## PerformanceRating 0.0422 1 0.84
## RelationshipSatisfaction 0.3228 1 0.57
## StockOptionLevel 4.1197 3 0.25
## TotalWorkingYears 0.4274 1 0.51
## TrainingTimesLastYear 0.9655 1 0.33
## WorkLifeBalance 0.3766 1 0.54
## Age 1.6740 1 0.20
## GLOBAL 18.8075 23 0.71
Success! Our model is now passing the PH test. It should be noted that just because a variable doesn’t pass this PH test it has to be excluded - we can multiply the variable by an interaction with time, we could change the “form” of the variable (i.e., take a continuous variable and convert to bins/factors) and eliminate extreme values.
Now let’s examine the summary output in more detail for our final model.
summary(cox.model.specified)
## Call:
## coxph(formula = surv.obj ~ DailyRate + DistanceFromHome + Education +
## EnvironmentSatisfaction + HourlyRate + JobInvolvement + JobSatisfaction +
## MonthlyIncome + MonthlyRate + OverTime + PercentSalaryHike +
## PerformanceRating + RelationshipSatisfaction + StockOptionLevel +
## TotalWorkingYears + TrainingTimesLastYear + WorkLifeBalance +
## Age, data = train)
##
## n= 1176, number of events= 193
##
## coef exp(coef) se(coef) z Pr(>|z|)
## DailyRate -1.603e-04 9.998e-01 1.864e-04 -0.860 0.389684
## DistanceFromHome 2.680e-02 1.027e+00 8.716e-03 3.074 0.002109 **
## Education2 5.443e-02 1.056e+00 2.793e-01 0.195 0.845513
## Education3 3.503e-01 1.420e+00 2.377e-01 1.474 0.140603
## Education4 1.788e-01 1.196e+00 2.652e-01 0.674 0.500241
## Education5 -3.881e-01 6.784e-01 5.727e-01 -0.678 0.498008
## EnvironmentSatisfaction -2.365e-01 7.894e-01 6.445e-02 -3.669 0.000244 ***
## HourlyRate -2.729e-03 9.973e-01 3.811e-03 -0.716 0.473975
## JobInvolvement -2.921e-01 7.467e-01 9.968e-02 -2.930 0.003389 **
## JobSatisfaction -3.262e-01 7.217e-01 6.708e-02 -4.863 1.16e-06 ***
## MonthlyIncome -6.692e-05 9.999e-01 3.503e-05 -1.911 0.056067 .
## MonthlyRate 1.068e-05 1.000e+00 1.019e-05 1.047 0.294901
## OverTimeYes 1.278e+00 3.590e+00 1.497e-01 8.537 < 2e-16 ***
## PercentSalaryHike 3.502e-03 1.004e+00 3.195e-02 0.110 0.912731
## PerformanceRating 8.860e-02 1.093e+00 3.244e-01 0.273 0.784769
## RelationshipSatisfaction -9.085e-02 9.132e-01 6.727e-02 -1.351 0.176826
## StockOptionLevel1 -9.119e-01 4.018e-01 1.749e-01 -5.215 1.84e-07 ***
## StockOptionLevel2 -1.106e+00 3.309e-01 3.328e-01 -3.323 0.000891 ***
## StockOptionLevel3 -4.341e-01 6.479e-01 3.200e-01 -1.357 0.174927
## TotalWorkingYears -2.854e-01 7.517e-01 3.115e-02 -9.160 < 2e-16 ***
## TrainingTimesLastYear -1.100e-01 8.959e-01 6.237e-02 -1.763 0.077905 .
## WorkLifeBalance -1.434e-01 8.664e-01 1.016e-01 -1.411 0.158198
## Age 8.193e-03 1.008e+00 1.144e-02 0.716 0.473961
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## DailyRate 0.9998 1.0002 0.9995 1.0002
## DistanceFromHome 1.0272 0.9736 1.0098 1.0449
## Education2 1.0559 0.9470 0.6108 1.8256
## Education3 1.4195 0.7045 0.8908 2.2621
## Education4 1.1957 0.8363 0.7111 2.0108
## Education5 0.6784 1.4742 0.2208 2.0843
## EnvironmentSatisfaction 0.7894 1.2668 0.6957 0.8957
## HourlyRate 0.9973 1.0027 0.9899 1.0048
## JobInvolvement 0.7467 1.3392 0.6142 0.9078
## JobSatisfaction 0.7217 1.3857 0.6327 0.8231
## MonthlyIncome 0.9999 1.0001 0.9999 1.0000
## MonthlyRate 1.0000 1.0000 1.0000 1.0000
## OverTimeYes 3.5904 0.2785 2.6773 4.8150
## PercentSalaryHike 1.0035 0.9965 0.9426 1.0684
## PerformanceRating 1.0926 0.9152 0.5785 2.0636
## RelationshipSatisfaction 0.9132 1.0951 0.8004 1.0418
## StockOptionLevel1 0.4018 2.4890 0.2852 0.5660
## StockOptionLevel2 0.3309 3.0220 0.1723 0.6353
## StockOptionLevel3 0.6479 1.5436 0.3460 1.2130
## TotalWorkingYears 0.7517 1.3303 0.7072 0.7991
## TrainingTimesLastYear 0.8959 1.1162 0.7928 1.0124
## WorkLifeBalance 0.8664 1.1542 0.7100 1.0573
## Age 1.0082 0.9918 0.9859 1.0311
##
## Concordance= 0.877 (se = 0.015 )
## Likelihood ratio test= 433 on 23 df, p=<2e-16
## Wald test = 308.2 on 23 df, p=<2e-16
## Score (logrank) test = 344.4 on 23 df, p=<2e-16
We can now look at a few key takeaways from the model. At the bottom of the output we have 3 tests for the global significance: Liklihood ratio tests, Wald test, and Logrank score. In our model, all three global sig tests would lead us to conclude that the model we have derived is significant (p < .00). The next step is then to determine which variables from our model are significantly predicting attrition. The column labled Pr(>|z|) indicates p-values at the individual variable level. To the right of the column, we can easily see which variables are significant.
The next step is determining how the variables are influencing towards our dependent variable (in our case, attrition). Our DV is coded as 0 = No attrit, 1 = Attrit. We can then look at the column labled coef and determine that if the number is positive, it is influencing towards Attrition while a negative value is influencing towards retention. We can then look at the column directly to the right, exp(coef), to calculate the hazards ratio (HR). The hazard ratios of covariates are interpretable as multiplicative effects on the hazard,holding the other covariates constant. The simple way to calculate/interpret the HR is to subtract the number in the exp(coef) colunmn from 1 and convert to a percentage. For example, JobSatisfaction exp(coef)= 0.72. We can then say that each unit increase of JobSatisfaction (1-5, higher scores indicate higher job satisfaction) has a 28% (1 - 0.72 = .28) of increasing retention. Similarly, we can look at OverTimeYes, exp(coef) = 3.644, subtract 1 from 3.644 (for ease of interpretation) and we get 2.64. After conversion to a percentage, we can estimate that, holding all other variables equal, being an overtime employee is associated with a 264% increased risk of attrition!
Before we get too caught up in our model interpretation, we should continue testing key assumptions. We already have the PH assumption out of the way so let’s explore others.
We can test the impact of influential observations with the code:
ggcoxdiagnostics(cox.model.specified, type = "dfbeta",
linear.predictions = FALSE, ggtheme = theme_bw())
## `geom_smooth()` using formula 'y ~ x'
and we can inspect residuals with:
ggcoxdiagnostics(cox.model.specified, type = "deviance",
linear.predictions = FALSE, ggtheme = theme_bw())
## `geom_smooth()` using formula 'y ~ x'
The above residuals should give us pause and stimulate discussion about why we might be observing this particular pattern.
We can also test for linearity of our continuous variables. Plotting the Martingale residuals against continuous covariates is a common approach used to detect nonlinearity or, in other words, to assess the functional form of a covariate (does our variable violate a key assumption of the test and therefore warrant some type of transformation). The function ggcoxfunctional() displays graphs of continuous covariates against martingale residuals of null cox proportional hazards model. Essentially, we want the fitted lowess lines to demonstrate a linear relation with the plotted data.
ggcoxfunctional(Surv(YearsAtCompany, Attrition) ~ TotalWorkingYears + DistanceFromHome + EnvironmentSatisfaction + JobInvolvement + JobSatisfaction + MonthlyIncome, data = train)
## Warning: arguments formula is deprecated; will be removed in the next version;
## please use fit instead.
In this case we might see some cause for concern in EnvironmentSatisfaction, DistanceFromHome, and MonthlyIncome. In some senses, this can make sense. Self-report satisfaction surveys tend to skew positive. Why don’t we try the same assumption test on a log transformed version of MonthlyIncome, to try and account for skew.
ggcoxfunctional(Surv(YearsAtCompany, Attrition) ~ log(MonthlyIncome), data = train)
## Warning: arguments formula is deprecated; will be removed in the next version;
## please use fit instead.
The following code essentially takes the Cox model we built earlier from the 80% split of data (train) and sees how well it applies to the 20% data we reserved for testing (test). The corresponding number we pull out is the area-under-the-curve (AUC). This number exists on a 0 to 1 range and essentially tells us how well our model predicts the test data. AUC scores above .5 indicate that the model is predicting 0’s when it should predict 0’s and 1’s when it should predict 1’s. Anything under .5 indicates that our model is predicting 0’s when it should predict 1’s and vice versa.
cox.pred.full <- predict(cox.model.specified, newdata = test, type = "lp")
roc.obj <- survivalROC::survivalROC(Stime = test$YearsAtCompany,
status = test$Attrition,
marker = cox.pred.full,
predict.time = 1,
lambda = 0.003)
roc.obj$AUC
## [1] 0.7994259
While an AUC score of .8 isn’t bad, let’s look at what happens when we use our earlier, full model.
cox.pred.full <- predict(cox.model.full, newdata = test, type = "lp")
roc.obj <- survivalROC::survivalROC(Stime = test$YearsAtCompany,
status = test$Attrition,
marker = cox.pred.full,
predict.time = 1,
lambda = 0.003)
roc.obj$AUC
## [1] 0.9701103
So essentially our model prediction cabability shot up to nearly perfect.
We can continue to apply these predictive features to our data to now guide decision making.
# Inputting risk estimates into our data
train$lp <- predict(cox.model.specified, type = "lp")
train$Risk <- predict(cox.model.specified, type = "risk")
train$Test <- predict(cox.model.full, type = "risk")
train$Exp <- predict(cox.model.specified, type = "expected")
test$Risk <- predict(cox.model.specified, newdata = test, type = "risk")
glimpse(train %>%
select(Attrition, lp, Risk, Test, Exp) %>% arrange(desc(Risk)))
## Rows: 1,176
## Columns: 5
## $ Attrition <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, ~
## $ lp <dbl> 5.572197, 5.528736, 5.366441, 5.270501, 5.215228, 5.071213, ~
## $ Risk <dbl> 263.01136, 251.82544, 214.09960, 194.51342, 184.05375, 159.3~
## $ Test <dbl> 5047.3302, 460.7552, 2568.8619, 1266.7733, 782.6617, 2023.72~
## $ Exp <dbl> 1.38017879, 1.32147952, 1.12350935, 1.02072892, 2.15397569, ~
glimpse(train %>%
select(Attrition, lp, Risk, Test, Exp) %>% arrange(-desc(Risk)))
## Rows: 1,176
## Columns: 5
## $ Attrition <dbl> 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, ~
## $ lp <dbl> -9.619404, -8.928902, -8.797936, -8.508254, -8.408845, -8.39~
## $ Risk <dbl> 0.0000664272, 0.0001325034, 0.0001510445, 0.0002017958, 0.00~
## $ Test <dbl> 4.315366e-05, 1.296558e-04, 3.346174e-04, 2.825509e-04, 6.87~
## $ Exp <dbl> 1.012227e+00, 1.921798e-05, 2.780320e-02, 3.714513e-02, 4.10~
glimpse(train %>%
select(Attrition, lp, Risk, Test, Exp) %>% arrange(desc(Test)))
## Rows: 1,176
## Columns: 5
## $ Attrition <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ~
## $ lp <dbl> 4.832234, 5.572197, 4.759101, 5.366441, 4.807865, 3.397195, ~
## $ Risk <dbl> 125.491056, 263.011361, 116.640991, 214.099597, 122.469847, ~
## $ Test <dbl> 8387.3881, 5047.3302, 2571.6417, 2568.8619, 2567.4601, 2218.~
## $ Exp <dbl> 0.107292957, 1.380178791, 0.612085428, 1.123509348, 0.104709~
glimpse(train %>%
select(Attrition, lp, Risk, Test, Exp) %>% arrange(-desc(Test)))
## Rows: 1,176
## Columns: 5
## $ Attrition <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, ~
## $ lp <dbl> -5.848341, -4.736709, -7.019900, -6.110899, -7.952284, -7.22~
## $ Risk <dbl> 0.0028846808, 0.0087674549, 0.0008939145, 0.0022185562, 0.00~
## $ Test <dbl> 1.745942e-11, 1.266040e-10, 2.107029e-09, 2.585660e-09, 3.52~
## $ Exp <dbl> 5.309915e-01, 3.807959e-02, 9.224744e-02, 3.217740e-04, 2.03~
Ideally we should be approaching HR issues with some form of testable hypothesis. One way to accomplish that is testing different nested models.
For example, let’s say we had a hypothesis that Stock Option Level significantly influenced attrition. The following code takes a few variables that we possibly already know are meaningful predictors of attrition (fit), and compares that model to one that includes the addition of Stock Option (fit2).
The ANOVA evaluates whether the additional variable explains enough additional variance in the model to be significant.
# Comparing models with anova()
fit<- coxph(formula = surv.obj ~ BusinessTravel + Gender + JobLevel + JobRole, data = train)
fit2<- coxph(formula = surv.obj ~ BusinessTravel + Gender + JobLevel + JobRole + StockOptionLevel, data = train)
anova(fit, fit2)
## Analysis of Deviance Table
## Cox model: response is surv.obj
## Model 1: ~ BusinessTravel + Gender + JobLevel + JobRole
## Model 2: ~ BusinessTravel + Gender + JobLevel + JobRole + StockOptionLevel
## loglik Chisq Df P(>|Chi|)
## 1 -1136.6
## 2 -1115.4 42.428 3 3.255e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Significant differences here indicate that our second model contributes to a greater proportion of variance explained compared to the first model.