This project is a survival analysis approached towards the employee turnover according to many factors related. Survival analysis is a branch of statistics for analyzing the expected duration of time until a significant event happen. In this project, the event will be employees quitting

Data Preparations

Calling all required libraries

library(survival)
library(ranger)
library(ggplot2)
library(dplyr)
library(ggfortify)
library(readxl)
library(MASS)
library(ADGofTest)
library(survminer)
library(car)

Importing the dataset

This project will be using The Employee Turnover dataset.

df=read.csv("D:/WRITE IT RIGHT 2/employee-turnover-weibull-regression/turnover.csv", header = TRUE)
head(df)
##        stag event gender age        industry profession     traffic coach
## 1  7.030801     1      m  35           Banks         HR rabrecNErab    no
## 2 22.965092     1      m  33           Banks         HR       empjs    no
## 3 15.934292     1      f  35 PowerGeneration         HR rabrecNErab    no
## 4 15.934292     1      f  35 PowerGeneration         HR rabrecNErab    no
## 5  8.410678     1      m  32          Retail Commercial       youjs   yes
## 6  8.969199     1      f  42     manufacture         HR       empjs   yes
##   head_gender greywage way extraversion independ selfcontrol anxiety novator
## 1           f    white bus          6.2      4.1         5.7     7.1     8.3
## 2           m    white bus          6.2      4.1         5.7     7.1     8.3
## 3           m    white bus          6.2      6.2         2.6     4.8     8.3
## 4           m    white bus          5.4      7.6         4.9     2.5     6.7
## 5           f    white bus          3.0      4.1         8.0     7.1     3.7
## 6           m    white bus          6.2      6.2         4.1     5.6     6.7

The data has 16 columns (variables) and 1067 rows. The variables explanation is written down below:

  • stag = Experience (time)
  • event = Employee turnover: turnover(1) or survive(0)
  • gender = Employee’s gender: female(f) or male(m)
  • age = Employee’s age (year)
  • industry = Employee’s Industry: Banks, PowerGeneration, Retail, manufacture, Consult, State, Building, IT, HoReCa, Telecom, Pharma, Mining, transport, Agriculture, RealEstate, etc.
  • profession = Employee’s profession: HR, Commercial, Marketing, Sales, BusinessDevelopment, Finance, Teaching, manage, IT, Law, Consult, Engineer, PR, Accounting, etc.
  • traffic = From what pipelene employee came to the company: rabrecNErab, empjs, youjs, referal, advert, KA, recNErab, and friends
  • coach = Presence of a coach (training) on probation: yes or no
  • head_gender = Head (supervisor) gender: female(f) or male(m)
  • greywage = The salary does not seem to the tax authorities
  • way = How an employee gets to workplace (by feet, by bus etc)
  • Big5 test scales, such as:
    • extraversion
    • independ
    • selfcontrol
    • anxiety
    • novator

Exploratory Data Analysis

In exploratory data analysis, it is crucial to run tests to determine whether the distribution of the time variable follows a particular distribution before modeling the data. The Weibull Distribution is the most common distribution used in survival analysis, while the time variable follows it. Thus, we will check on stag as the time variable to see if it has the Weibull distribution.

As a comparator of p-values in this project, we will use the \(\alpha = 0.05\)

Distribution testing

fitdistr(df$stag, "weibull")
##       shape         scale   
##    1.08272153   37.78782884 
##  ( 0.02503311) ( 1.09608991)

According to the result in the previous code, we obtained the Weibull Distribution parameters are:

  • shape = 1.08272153
  • scale = 37.78782884

Therefore, we can continue to test the time variable using Anderson-Darling Goodness of Fit test using required parameters.

ad.test(df$stag, pweibull, shape= 1.08272153, scale=37.78782884)
## 
##  Anderson-Darling GoF Test
## 
## data:  df$stag  and  pweibull
## AD = 1.3391, p-value = 0.22
## alternative hypothesis: NA

We obtained the p-value of 0.22 > 0.05, therefore we can conclude that the time variable is following The Weibull Distribution.

Kaplan-Meier Estimation and Log-Rank Test for The Categorical Variables

There are gender, industry, profression, traffic, coach, head_gender, greywage, and way as the categorical variables. The Kaplan-Meier Estimation Curve in survival analysis allows us to evaluate how each category of survival probability differs from the others. To determine the difference between each category based on the curve we will receive, we will thus use The Kaplan Meier Estimation. We will run the Log-Rank test to ensure if there is a difference between the categories of the variables.

Gender

By running the following code, we will first perform the Kaplan-Meier estimation according to the gender.

km_gender<-survfit(Surv(stag, event)~gender, data = df, type="kaplan-meier")

When the code above is run, the code below can be used to display the Kaplan-Meier estimation curve.

ggsurvplot(km_gender, data=df,
           conf.int = FALSE,
           ggtheme = theme_minimal(),
           legend.labs = c("female", "male"),
           pval = TRUE,
           pval.method = TRUE)+
  ggtitle("Survival curve based on Gender")

The Kaplan-Meier curve shown above shows that female workers have a higher survival probability than male workers.

Since we achieved the p-value of 0.13 > 0.05, we can say that there is no significant distinction between the survival probabilities of each gender.

Industry

By running the following code, we will first perform the Kaplan-Meier estimation according to the industry.

km_industry<-survfit(Surv(stag, event)~industry, data = df, type ="kaplan-meier")

When the code above is run, the code below can be used to display the Kaplan-Meier estimation curve.

ggsurvplot(km_industry, data=df,
           conf.int = FALSE,
           ggtheme = theme_minimal(),
           pval = TRUE,
           pval.method = TRUE)+
  ggtitle("Survival curve based on Industry")

The Kaplan-Meier curve shown above demonstrates that employees in the retail industry have the highest chance of surviving. Workers in the agricultural industry, however, have the lowest chances of surviving.

Since we achieved the p-value < 0.05, we can say that there is significant distinction between the survival probabilities of each industry.

Profession

By running the following code, we will first perform the Kaplan-Meier estimation according to the profession.

km_profession<-survfit(Surv(stag, event)~profession, data = df, type="kaplan-meier")

When the code above is run, the code below can be used to display the Kaplan-Meier estimation curve.

ggsurvplot(km_profession, data=df,
           conf.int = FALSE,
           ggtheme = theme_minimal(),
           pval = TRUE,
           pval.method = TRUE)+
  ggtitle("Survival curve based on Profession")

The Kaplan-Meier curve shown above shows that employees in the HR profession have the highest chance of surviving. Workers in the law profession, on the other hand, have the lowest chances of surviving.

Since we achieved the p-value of 0.0087 < 0.05, we can say that there is significant distinction between the survival probabilities of each profession.

Traffic

By running the following code, we will first perform the Kaplan-Meier estimation according to the traffic.

km_traffic<-survfit(Surv(stag, event)~traffic, data= df, type="kaplan-meier")

When the code above is run, the code below can be used to display the Kaplan-Meier estimation curve.

ggsurvplot(km_traffic, data=df,
           conf.int = FALSE,
           ggtheme = theme_minimal(),
           pval = TRUE,
           pval.method = TRUE)+
  ggtitle("Survival curve based on Traffic")

Workers from the ‘traffic’ of KA have the highest probability of surviving, as shown by the Kaplan-Meier curve above. Workers from the “traffic” of friends, on the other hand, have the lowest chance of surviving.

Since we achieved the p-value of 0.001 < 0.05, we can say that there is significant distinction between the survival probabilities of each category of traffic.

Coach

By running the following code, we will first perform the Kaplan-Meier estimation according to the coach.

km_coach<-survfit(Surv(stag, event)~coach, data = df, type="kaplan-meier")

When the code above is run, the code below can be used to display the Kaplan-Meier estimation curve.

ggsurvplot(km_coach, data=df,
           conf.int = FALSE,
           ggtheme = theme_minimal(),
           pval = TRUE,
           pval.method = TRUE)+
  ggtitle("Survival curve based on Coach")

The workers from the ‘coach’ of no have the highest chance of surviving, as shown by the Kaplan-Meier curve above. Workers from the ‘coach’ of yes, on the other hand, have the lowest chance of surviving.

Since we achieved the p-value of 0.19 > 0.05, we can say that there is no significant distinction between the survival probabilities of each category of coach.

Head Gender

By running the following code, we will first perform the Kaplan-Meier estimation according to the head_gender.

km_headgender<-survfit(Surv(stag, event)~head_gender, data= df, type="kaplan-meier")

When the code above is run, the code below can be used to display the Kaplan-Meier estimation curve.

ggsurvplot(km_headgender, data=df,
           conf.int = FALSE,
           ggtheme = theme_minimal(),
           legend.labs = c("female", "male"),
           pval = TRUE,
           pval.method = TRUE)+
  ggtitle("Survival curve based on Head Gender")

We can observe from the Kaplan-Meier estimator that workers with a female head_gender have a higher chance of surviving than those with a male head_gender.

Since we achieved the p-value of 0.27 > 0.05, we can say that there is no significant distinction between the survival probabilities of head_gender.

Greywage

By running the following code, we will first perform the Kaplan-Meier estimation according to the greywage.

km_greywage<-survfit(Surv(stag, event)~greywage, data = df, type ="kaplan-meier")

When the code above is run, the code below can be used to display the Kaplan-Meier estimation curve.

ggsurvplot(km_greywage, data=df,
           conf.int = FALSE,
           ggtheme = theme_minimal(),
           legend.labs = c("grey", "white"),
           pval = TRUE,
           pval.method = TRUE)+
  ggtitle("Survival curve based on Greywage")

The Kaplan-Meier estimate shown above shows that workers with the ‘greywage’ of white have a higher likelihood of surviving than those with the ‘greywage’ of grey.

Since we achieved the p-value of 0.0001 < 0.05, we can say that there is significant distinction between the survival probabilities of greywage.

Way

By running the following code, we will first perform the Kaplan-Meier estimation according to the way.

km_way<-survfit(Surv(stag, event)~way,
                data= df,
                type="kaplan-meier")

When the code above is run, the code below can be used to display the Kaplan-Meier estimation curve.

ggsurvplot(km_way, data=df,
           conf.int = FALSE,
           ggtheme = theme_minimal(),
           legend.labs=c("bus", "car", "foot"),
           pval = TRUE,
           pval.method = TRUE)+
  ggtitle("Survival curve based on Commuters(way)")

Workers who take the bus to work have the best chance of surviving, as seen by the Kaplan-Meier curve above. On the other hand, those who commute to work by car have the lowest chance of surviving.

Since we achieved the p-value of 0.003 < 0.05, we can say that there is significant distinction between the survival probabilities of way.

Data Modelling

We will first use the Weibull regression model to create the main model for the data, which will incorporate all independent variables.

model0<-survreg(Surv(stag, event)~.,
                 data = df, dist = "weibull")

We can focus the model using the independent variable that has the greatest influence from the main model. It’s possible that the primary model we find didn’t accurately capture the factors that had the greatest impact on the employees’ survival. As a result, we will execute the code below, which will produce a list of all possible combinations of independent variables together with their AIC values. The best appropriate model that we can use will be indicated by the AIC number itself.

step(model0)
## Start:  AIC=5954.11
## Surv(stag, event) ~ gender + age + industry + profession + traffic + 
##     coach + head_gender + greywage + way + extraversion + independ + 
##     selfcontrol + anxiety + novator
## 
##                Df    AIC
## - novator       1 5952.2
## - extraversion  1 5952.2
## - coach         2 5952.3
## - independ      1 5952.4
## - head_gender   1 5952.6
## - gender        1 5952.6
## <none>            5954.1
## - selfcontrol   1 5954.6
## - anxiety       1 5955.1
## - profession   14 5956.2
## - way           2 5956.5
## - age           1 5962.0
## - greywage      1 5964.3
## - traffic       7 5978.6
## - industry     15 5980.3
## 
## Step:  AIC=5952.15
## Surv(stag, event) ~ gender + age + industry + profession + traffic + 
##     coach + head_gender + greywage + way + extraversion + independ + 
##     selfcontrol + anxiety
## 
##                Df    AIC
## - extraversion  1 5950.3
## - coach         2 5950.4
## - independ      1 5950.5
## - gender        1 5950.6
## - head_gender   1 5950.6
## <none>            5952.2
## - anxiety       1 5953.1
## - selfcontrol   1 5953.7
## - profession   14 5954.2
## - way           2 5954.5
## - age           1 5960.1
## - greywage      1 5962.3
## - traffic       7 5976.6
## - industry     15 5978.7
## 
## Step:  AIC=5950.25
## Surv(stag, event) ~ gender + age + industry + profession + traffic + 
##     coach + head_gender + greywage + way + independ + selfcontrol + 
##     anxiety
## 
##               Df    AIC
## - coach        2 5948.6
## - gender       1 5948.7
## - head_gender  1 5948.7
## - independ     1 5949.0
## <none>           5950.3
## - profession  14 5952.6
## - anxiety      1 5952.6
## - way          2 5952.6
## - selfcontrol  1 5956.1
## - age          1 5958.1
## - greywage     1 5960.3
## - traffic      7 5975.1
## - industry    15 5977.1
## 
## Step:  AIC=5948.55
## Surv(stag, event) ~ gender + age + industry + profession + traffic + 
##     head_gender + greywage + way + independ + selfcontrol + anxiety
## 
##               Df    AIC
## - gender       1 5947.0
## - head_gender  1 5947.0
## - independ     1 5947.4
## <none>           5948.6
## - way          2 5950.4
## - anxiety      1 5951.1
## - profession  14 5951.3
## - selfcontrol  1 5955.4
## - age          1 5956.8
## - greywage     1 5958.8
## - traffic      7 5973.2
## - industry    15 5977.0
## 
## Step:  AIC=5946.99
## Surv(stag, event) ~ age + industry + profession + traffic + head_gender + 
##     greywage + way + independ + selfcontrol + anxiety
## 
##               Df    AIC
## - head_gender  1 5945.4
## - independ     1 5945.9
## <none>           5947.0
## - way          2 5949.0
## - profession  14 5949.4
## - anxiety      1 5950.6
## - selfcontrol  1 5954.6
## - age          1 5955.3
## - greywage     1 5957.6
## - traffic      7 5971.7
## - industry    15 5975.7
## 
## Step:  AIC=5945.41
## Surv(stag, event) ~ age + industry + profession + traffic + greywage + 
##     way + independ + selfcontrol + anxiety
## 
##               Df    AIC
## - independ     1 5944.4
## <none>           5945.4
## - way          2 5947.4
## - profession  14 5948.4
## - anxiety      1 5949.3
## - selfcontrol  1 5953.3
## - age          1 5955.7
## - greywage     1 5956.0
## - traffic      7 5969.7
## - industry    15 5974.3
## 
## Step:  AIC=5944.43
## Surv(stag, event) ~ age + industry + profession + traffic + greywage + 
##     way + selfcontrol + anxiety
## 
##               Df    AIC
## <none>           5944.4
## - way          2 5946.5
## - profession  14 5947.0
## - anxiety      1 5947.3
## - selfcontrol  1 5951.3
## - age          1 5954.0
## - greywage     1 5955.4
## - traffic      7 5968.9
## - industry    15 5973.3
## Call:
## survreg(formula = Surv(stag, event) ~ age + industry + profession + 
##     traffic + greywage + way + selfcontrol + anxiety, data = df, 
##     dist = "weibull")
## 
## Coefficients:
##                   (Intercept)                           age 
##                    4.43131693                   -0.01854194 
##           industryAgriculture                 industryBanks 
##                   -0.73601255                   -0.42850204 
##              industryBuilding               industryConsult 
##                   -0.47008103                   -0.30034308 
##                   industryetc                    industryIT 
##                   -0.14455627                    0.36524531 
##           industrymanufacture                industryMining 
##                    0.06281340                   -0.10000811 
##                industryPharma       industryPowerGeneration 
##                    0.07751048                    0.16088835 
##            industryRealEstate                industryRetail 
##                    0.83143438                    0.21318139 
##                 industryState               industryTelecom 
##                   -0.06445685                    0.36245948 
##             industrytransport professionBusinessDevelopment 
##                    0.02260270                   -0.50212648 
##          professionCommercial             professionConsult 
##                   -0.82616142                   -0.50600980 
##            professionEngineer                 professionetc 
##                   -0.84671677                   -0.40826416 
##             professionFinance                  professionHR 
##                   -0.17683605                   -0.21913841 
##                  professionIT                 professionLaw 
##                   -0.03975636                   -0.33546462 
##              professionmanage           professionMarketing 
##                   -1.11650093                   -0.64164375 
##                  professionPR               professionSales 
##                   -0.71016173                   -0.49774936 
##            professionTeaching                  trafficempjs 
##                   -0.54065998                   -0.71817392 
##                trafficfriends                     trafficKA 
##                   -0.02755587                   -0.09380616 
##            trafficrabrecNErab               trafficrecNErab 
##                   -0.37274217                    0.08640852 
##                trafficreferal                  trafficyoujs 
##                   -0.23576450                   -0.49756629 
##                 greywagewhite                        waycar 
##                    0.42084521                    0.14436829 
##                       wayfoot                   selfcontrol 
##                    0.29015023                    0.05623800 
##                       anxiety 
##                    0.04809703 
## 
## Scale= 0.8428331 
## 
## Loglik(model)= -2928.2   Loglik(intercept only)= -3014.3
##  Chisq= 172.11 on 42 degrees of freedom, p= <2e-16 
## n= 1129

The model including independent variables like age, industry, profession, traffic, greywage, self-control, and anxiety produced the best accurate survival model, according to the abovementioned result. As a result, we will incorporate these variables into the new model.

model1<-survreg(Surv(stag, event)~age + industry + profession + traffic + greywage + way + selfcontrol + anxiety,
                data = df,
                dist = "weibull")
summary(model1)
## 
## Call:
## survreg(formula = Surv(stag, event) ~ age + industry + profession + 
##     traffic + greywage + way + selfcontrol + anxiety, data = df, 
##     dist = "weibull")
##                                  Value Std. Error     z       p
## (Intercept)                    4.43132    0.62299  7.11 1.1e-12
## age                           -0.01854    0.00533 -3.48 0.00051
## industryAgriculture           -0.73601    0.45308 -1.62 0.10427
## industryBanks                 -0.42850    0.36440 -1.18 0.23964
## industryBuilding              -0.47008    0.38380 -1.22 0.22065
## industryConsult               -0.30034    0.37685 -0.80 0.42546
## industryetc                   -0.14456    0.37051 -0.39 0.69642
## industryIT                     0.36525    0.37923  0.96 0.33548
## industrymanufacture            0.06281    0.36551  0.17 0.86355
## industryMining                -0.10001    0.43080 -0.23 0.81643
## industryPharma                 0.07751    0.43843  0.18 0.85967
## industryPowerGeneration        0.16089    0.41584  0.39 0.69883
## industryRealEstate             0.83143    0.52932  1.57 0.11624
## industryRetail                 0.21318    0.35890  0.59 0.55252
## industryState                 -0.06446    0.39766 -0.16 0.87124
## industryTelecom                0.36246    0.41775  0.87 0.38559
## industrytransport              0.02260    0.40834  0.06 0.95586
## professionBusinessDevelopment -0.50213    0.42064 -1.19 0.23259
## professionCommercial          -0.82616    0.41869 -1.97 0.04847
## professionConsult             -0.50601    0.42676 -1.19 0.23574
## professionEngineer            -0.84672    0.44316 -1.91 0.05605
## professionetc                 -0.40826    0.40485 -1.01 0.31325
## professionFinance             -0.17684    0.43121 -0.41 0.68174
## professionHR                  -0.21914    0.35604 -0.62 0.53823
## professionIT                  -0.03976    0.39811 -0.10 0.92045
## professionLaw                 -0.33546    0.53828 -0.62 0.53315
## professionmanage              -1.11650    0.41709 -2.68 0.00743
## professionMarketing           -0.64164    0.40099 -1.60 0.10957
## professionPR                  -0.71016    0.53669 -1.32 0.18576
## professionSales               -0.49775    0.38223 -1.30 0.19284
## professionTeaching            -0.54066    0.47873 -1.13 0.25874
## trafficempjs                  -0.71817    0.25848 -2.78 0.00546
## trafficfriends                -0.02756    0.28014 -0.10 0.92164
## trafficKA                     -0.09381    0.29140 -0.32 0.74752
## trafficrabrecNErab            -0.37274    0.25591 -1.46 0.14525
## trafficrecNErab                0.08641    0.31364  0.28 0.78293
## trafficreferal                -0.23576    0.26974 -0.87 0.38209
## trafficyoujs                  -0.49757    0.25539 -1.95 0.05138
## greywagewhite                  0.42085    0.11094  3.79 0.00015
## waycar                         0.14437    0.08448  1.71 0.08748
## wayfoot                        0.29015    0.14447  2.01 0.04460
## selfcontrol                    0.05624    0.01884  2.98 0.00284
## anxiety                        0.04810    0.02176  2.21 0.02706
## Log(scale)                    -0.17099    0.03314 -5.16 2.5e-07
## 
## Scale= 0.843 
## 
## Weibull distribution
## Loglik(model)= -2928.2   Loglik(intercept only)= -3014.3
##  Chisq= 172.11 on 42 degrees of freedom, p= 1.1e-17 
## Number of Newton-Raphson Iterations: 7 
## n= 1129

The hazard ratio compares event rates to determine the proportional risk of the complication. Run the code below to obtain the hazard ratio.

hr=exp(model1$coefficients)
hr
##                   (Intercept)                           age 
##                    84.0420217                     0.9816289 
##           industryAgriculture                 industryBanks 
##                     0.4790202                     0.6514843 
##              industryBuilding               industryConsult 
##                     0.6249516                     0.7405641 
##                   industryetc                    industryIT 
##                     0.8654062                     1.4408674 
##           industrymanufacture                industryMining 
##                     1.0648281                     0.9048301 
##                industryPharma       industryPowerGeneration 
##                     1.0805936                     1.1745538 
##            industryRealEstate                industryRetail 
##                     2.2966106                     1.2376091 
##                 industryState               industryTelecom 
##                     0.9375766                     1.4368590 
##             industrytransport professionBusinessDevelopment 
##                     1.0228601                     0.6052423 
##          professionCommercial             professionConsult 
##                     0.4377263                     0.6028965 
##            professionEngineer                 professionetc 
##                     0.4288205                     0.6648032 
##             professionFinance                  professionHR 
##                     0.8379172                     0.8032105 
##                  professionIT                 professionLaw 
##                     0.9610236                     0.7150058 
##              professionmanage           professionMarketing 
##                     0.3274235                     0.5264264 
##                  professionPR               professionSales 
##                     0.4915647                     0.6078973 
##            professionTeaching                  trafficempjs 
##                     0.5823638                     0.4876419 
##                trafficfriends                     trafficKA 
##                     0.9728203                     0.9104592 
##            trafficrabrecNErab               trafficrecNErab 
##                     0.6888428                     1.0902516 
##                trafficreferal                  trafficyoujs 
##                     0.7899667                     0.6080086 
##                 greywagewhite                        waycar 
##                     1.5232485                     1.1553095 
##                       wayfoot                   selfcontrol 
##                     1.3366283                     1.0578494 
##                       anxiety 
##                     1.0492725