This project is a survival analysis approached towards the employee turnover according to many factors related. Survival analysis is a branch of statistics for analyzing the expected duration of time until a significant event happen. In this project, the event will be employees quitting
library(survival)
library(ranger)
library(ggplot2)
library(dplyr)
library(ggfortify)
library(readxl)
library(MASS)
library(ADGofTest)
library(survminer)
library(car)
This project will be using The Employee Turnover dataset.
df=read.csv("D:/WRITE IT RIGHT 2/employee-turnover-weibull-regression/turnover.csv", header = TRUE)
head(df)
## stag event gender age industry profession traffic coach
## 1 7.030801 1 m 35 Banks HR rabrecNErab no
## 2 22.965092 1 m 33 Banks HR empjs no
## 3 15.934292 1 f 35 PowerGeneration HR rabrecNErab no
## 4 15.934292 1 f 35 PowerGeneration HR rabrecNErab no
## 5 8.410678 1 m 32 Retail Commercial youjs yes
## 6 8.969199 1 f 42 manufacture HR empjs yes
## head_gender greywage way extraversion independ selfcontrol anxiety novator
## 1 f white bus 6.2 4.1 5.7 7.1 8.3
## 2 m white bus 6.2 4.1 5.7 7.1 8.3
## 3 m white bus 6.2 6.2 2.6 4.8 8.3
## 4 m white bus 5.4 7.6 4.9 2.5 6.7
## 5 f white bus 3.0 4.1 8.0 7.1 3.7
## 6 m white bus 6.2 6.2 4.1 5.6 6.7
The data has 16 columns (variables) and 1067 rows. The variables explanation is written down below:
stag = Experience (time)event = Employee turnover: turnover(1) or
survive(0)gender = Employee’s gender: female(f) or male(m)age = Employee’s age (year)industry = Employee’s Industry: Banks, PowerGeneration,
Retail, manufacture, Consult, State, Building, IT, HoReCa, Telecom,
Pharma, Mining, transport, Agriculture, RealEstate, etc.profession = Employee’s profession: HR, Commercial,
Marketing, Sales, BusinessDevelopment, Finance, Teaching, manage, IT,
Law, Consult, Engineer, PR, Accounting, etc.traffic = From what pipelene employee came to the
company: rabrecNErab, empjs, youjs, referal, advert, KA, recNErab, and
friendscoach = Presence of a coach (training) on probation:
yes or nohead_gender = Head (supervisor) gender: female(f) or
male(m)greywage = The salary does not seem to the tax
authoritiesway = How an employee gets to workplace (by feet, by
bus etc)extraversionindependselfcontrolanxietynovatorIn exploratory data analysis, it is crucial to run tests to determine
whether the distribution of the time variable follows a particular
distribution before modeling the data. The Weibull Distribution is the
most common distribution used in survival analysis, while the time
variable follows it. Thus, we will check on stag as the
time variable to see if it has the Weibull distribution.
As a comparator of p-values in this project, we will use the \(\alpha = 0.05\)
fitdistr(df$stag, "weibull")
## shape scale
## 1.08272153 37.78782884
## ( 0.02503311) ( 1.09608991)
According to the result in the previous code, we obtained the Weibull Distribution parameters are:
Therefore, we can continue to test the time variable using Anderson-Darling Goodness of Fit test using required parameters.
ad.test(df$stag, pweibull, shape= 1.08272153, scale=37.78782884)
##
## Anderson-Darling GoF Test
##
## data: df$stag and pweibull
## AD = 1.3391, p-value = 0.22
## alternative hypothesis: NA
We obtained the p-value of 0.22 > 0.05, therefore we can conclude that the time variable is following The Weibull Distribution.
There are gender, industry,
profression, traffic, coach,
head_gender, greywage, and way as
the categorical variables. The Kaplan-Meier Estimation Curve in survival
analysis allows us to evaluate how each category of survival probability
differs from the others. To determine the difference between each
category based on the curve we will receive, we will thus use The Kaplan
Meier Estimation. We will run the Log-Rank test to ensure if there is a
difference between the categories of the variables.
By running the following code, we will first perform the Kaplan-Meier
estimation according to the gender.
km_gender<-survfit(Surv(stag, event)~gender, data = df, type="kaplan-meier")
When the code above is run, the code below can be used to display the Kaplan-Meier estimation curve.
ggsurvplot(km_gender, data=df,
conf.int = FALSE,
ggtheme = theme_minimal(),
legend.labs = c("female", "male"),
pval = TRUE,
pval.method = TRUE)+
ggtitle("Survival curve based on Gender")
The Kaplan-Meier curve shown above shows that female workers have a higher survival probability than male workers.
Since we achieved the p-value of 0.13 > 0.05, we can say that
there is no significant distinction between the survival probabilities
of each gender.
By running the following code, we will first perform the Kaplan-Meier
estimation according to the industry.
km_industry<-survfit(Surv(stag, event)~industry, data = df, type ="kaplan-meier")
When the code above is run, the code below can be used to display the Kaplan-Meier estimation curve.
ggsurvplot(km_industry, data=df,
conf.int = FALSE,
ggtheme = theme_minimal(),
pval = TRUE,
pval.method = TRUE)+
ggtitle("Survival curve based on Industry")
The Kaplan-Meier curve shown above demonstrates that employees in the retail industry have the highest chance of surviving. Workers in the agricultural industry, however, have the lowest chances of surviving.
Since we achieved the p-value < 0.05, we can say that there is
significant distinction between the survival probabilities of each
industry.
By running the following code, we will first perform the Kaplan-Meier
estimation according to the profession.
km_profession<-survfit(Surv(stag, event)~profession, data = df, type="kaplan-meier")
When the code above is run, the code below can be used to display the Kaplan-Meier estimation curve.
ggsurvplot(km_profession, data=df,
conf.int = FALSE,
ggtheme = theme_minimal(),
pval = TRUE,
pval.method = TRUE)+
ggtitle("Survival curve based on Profession")
The Kaplan-Meier curve shown above shows that employees in the HR profession have the highest chance of surviving. Workers in the law profession, on the other hand, have the lowest chances of surviving.
Since we achieved the p-value of 0.0087 < 0.05, we can say that
there is significant distinction between the survival probabilities of
each profession.
By running the following code, we will first perform the Kaplan-Meier
estimation according to the traffic.
km_traffic<-survfit(Surv(stag, event)~traffic, data= df, type="kaplan-meier")
When the code above is run, the code below can be used to display the Kaplan-Meier estimation curve.
ggsurvplot(km_traffic, data=df,
conf.int = FALSE,
ggtheme = theme_minimal(),
pval = TRUE,
pval.method = TRUE)+
ggtitle("Survival curve based on Traffic")
Workers from the ‘traffic’ of KA have the highest probability of
surviving, as shown by the Kaplan-Meier curve above. Workers from the
“traffic” of friends, on the other hand, have the lowest chance of
surviving.
Since we achieved the p-value of 0.001 < 0.05, we can say that
there is significant distinction between the survival probabilities of
each category of traffic.
By running the following code, we will first perform the Kaplan-Meier
estimation according to the coach.
km_coach<-survfit(Surv(stag, event)~coach, data = df, type="kaplan-meier")
When the code above is run, the code below can be used to display the Kaplan-Meier estimation curve.
ggsurvplot(km_coach, data=df,
conf.int = FALSE,
ggtheme = theme_minimal(),
pval = TRUE,
pval.method = TRUE)+
ggtitle("Survival curve based on Coach")
The workers from the ‘coach’ of no have the highest chance of surviving, as shown by the Kaplan-Meier curve above. Workers from the ‘coach’ of yes, on the other hand, have the lowest chance of surviving.
Since we achieved the p-value of 0.19 > 0.05, we can say that
there is no significant distinction between the survival probabilities
of each category of coach.
By running the following code, we will first perform the Kaplan-Meier
estimation according to the head_gender.
km_headgender<-survfit(Surv(stag, event)~head_gender, data= df, type="kaplan-meier")
When the code above is run, the code below can be used to display the Kaplan-Meier estimation curve.
ggsurvplot(km_headgender, data=df,
conf.int = FALSE,
ggtheme = theme_minimal(),
legend.labs = c("female", "male"),
pval = TRUE,
pval.method = TRUE)+
ggtitle("Survival curve based on Head Gender")
We can observe from the Kaplan-Meier estimator that workers with a female head_gender have a higher chance of surviving than those with a male head_gender.
Since we achieved the p-value of 0.27 > 0.05, we can say that
there is no significant distinction between the survival probabilities
of head_gender.
By running the following code, we will first perform the Kaplan-Meier
estimation according to the greywage.
km_greywage<-survfit(Surv(stag, event)~greywage, data = df, type ="kaplan-meier")
When the code above is run, the code below can be used to display the Kaplan-Meier estimation curve.
ggsurvplot(km_greywage, data=df,
conf.int = FALSE,
ggtheme = theme_minimal(),
legend.labs = c("grey", "white"),
pval = TRUE,
pval.method = TRUE)+
ggtitle("Survival curve based on Greywage")
The Kaplan-Meier estimate shown above shows that workers with the ‘greywage’ of white have a higher likelihood of surviving than those with the ‘greywage’ of grey.
Since we achieved the p-value of 0.0001 < 0.05, we can say that
there is significant distinction between the survival probabilities of
greywage.
By running the following code, we will first perform the Kaplan-Meier
estimation according to the way.
km_way<-survfit(Surv(stag, event)~way,
data= df,
type="kaplan-meier")
When the code above is run, the code below can be used to display the Kaplan-Meier estimation curve.
ggsurvplot(km_way, data=df,
conf.int = FALSE,
ggtheme = theme_minimal(),
legend.labs=c("bus", "car", "foot"),
pval = TRUE,
pval.method = TRUE)+
ggtitle("Survival curve based on Commuters(way)")
Workers who take the bus to work have the best chance of surviving, as
seen by the Kaplan-Meier curve above. On the other hand, those who
commute to work by car have the lowest chance of surviving.
Since we achieved the p-value of 0.003 < 0.05, we can say that
there is significant distinction between the survival probabilities of
way.
We will first use the Weibull regression model to create the main model for the data, which will incorporate all independent variables.
model0<-survreg(Surv(stag, event)~.,
data = df, dist = "weibull")
We can focus the model using the independent variable that has the greatest influence from the main model. It’s possible that the primary model we find didn’t accurately capture the factors that had the greatest impact on the employees’ survival. As a result, we will execute the code below, which will produce a list of all possible combinations of independent variables together with their AIC values. The best appropriate model that we can use will be indicated by the AIC number itself.
step(model0)
## Start: AIC=5954.11
## Surv(stag, event) ~ gender + age + industry + profession + traffic +
## coach + head_gender + greywage + way + extraversion + independ +
## selfcontrol + anxiety + novator
##
## Df AIC
## - novator 1 5952.2
## - extraversion 1 5952.2
## - coach 2 5952.3
## - independ 1 5952.4
## - head_gender 1 5952.6
## - gender 1 5952.6
## <none> 5954.1
## - selfcontrol 1 5954.6
## - anxiety 1 5955.1
## - profession 14 5956.2
## - way 2 5956.5
## - age 1 5962.0
## - greywage 1 5964.3
## - traffic 7 5978.6
## - industry 15 5980.3
##
## Step: AIC=5952.15
## Surv(stag, event) ~ gender + age + industry + profession + traffic +
## coach + head_gender + greywage + way + extraversion + independ +
## selfcontrol + anxiety
##
## Df AIC
## - extraversion 1 5950.3
## - coach 2 5950.4
## - independ 1 5950.5
## - gender 1 5950.6
## - head_gender 1 5950.6
## <none> 5952.2
## - anxiety 1 5953.1
## - selfcontrol 1 5953.7
## - profession 14 5954.2
## - way 2 5954.5
## - age 1 5960.1
## - greywage 1 5962.3
## - traffic 7 5976.6
## - industry 15 5978.7
##
## Step: AIC=5950.25
## Surv(stag, event) ~ gender + age + industry + profession + traffic +
## coach + head_gender + greywage + way + independ + selfcontrol +
## anxiety
##
## Df AIC
## - coach 2 5948.6
## - gender 1 5948.7
## - head_gender 1 5948.7
## - independ 1 5949.0
## <none> 5950.3
## - profession 14 5952.6
## - anxiety 1 5952.6
## - way 2 5952.6
## - selfcontrol 1 5956.1
## - age 1 5958.1
## - greywage 1 5960.3
## - traffic 7 5975.1
## - industry 15 5977.1
##
## Step: AIC=5948.55
## Surv(stag, event) ~ gender + age + industry + profession + traffic +
## head_gender + greywage + way + independ + selfcontrol + anxiety
##
## Df AIC
## - gender 1 5947.0
## - head_gender 1 5947.0
## - independ 1 5947.4
## <none> 5948.6
## - way 2 5950.4
## - anxiety 1 5951.1
## - profession 14 5951.3
## - selfcontrol 1 5955.4
## - age 1 5956.8
## - greywage 1 5958.8
## - traffic 7 5973.2
## - industry 15 5977.0
##
## Step: AIC=5946.99
## Surv(stag, event) ~ age + industry + profession + traffic + head_gender +
## greywage + way + independ + selfcontrol + anxiety
##
## Df AIC
## - head_gender 1 5945.4
## - independ 1 5945.9
## <none> 5947.0
## - way 2 5949.0
## - profession 14 5949.4
## - anxiety 1 5950.6
## - selfcontrol 1 5954.6
## - age 1 5955.3
## - greywage 1 5957.6
## - traffic 7 5971.7
## - industry 15 5975.7
##
## Step: AIC=5945.41
## Surv(stag, event) ~ age + industry + profession + traffic + greywage +
## way + independ + selfcontrol + anxiety
##
## Df AIC
## - independ 1 5944.4
## <none> 5945.4
## - way 2 5947.4
## - profession 14 5948.4
## - anxiety 1 5949.3
## - selfcontrol 1 5953.3
## - age 1 5955.7
## - greywage 1 5956.0
## - traffic 7 5969.7
## - industry 15 5974.3
##
## Step: AIC=5944.43
## Surv(stag, event) ~ age + industry + profession + traffic + greywage +
## way + selfcontrol + anxiety
##
## Df AIC
## <none> 5944.4
## - way 2 5946.5
## - profession 14 5947.0
## - anxiety 1 5947.3
## - selfcontrol 1 5951.3
## - age 1 5954.0
## - greywage 1 5955.4
## - traffic 7 5968.9
## - industry 15 5973.3
## Call:
## survreg(formula = Surv(stag, event) ~ age + industry + profession +
## traffic + greywage + way + selfcontrol + anxiety, data = df,
## dist = "weibull")
##
## Coefficients:
## (Intercept) age
## 4.43131693 -0.01854194
## industryAgriculture industryBanks
## -0.73601255 -0.42850204
## industryBuilding industryConsult
## -0.47008103 -0.30034308
## industryetc industryIT
## -0.14455627 0.36524531
## industrymanufacture industryMining
## 0.06281340 -0.10000811
## industryPharma industryPowerGeneration
## 0.07751048 0.16088835
## industryRealEstate industryRetail
## 0.83143438 0.21318139
## industryState industryTelecom
## -0.06445685 0.36245948
## industrytransport professionBusinessDevelopment
## 0.02260270 -0.50212648
## professionCommercial professionConsult
## -0.82616142 -0.50600980
## professionEngineer professionetc
## -0.84671677 -0.40826416
## professionFinance professionHR
## -0.17683605 -0.21913841
## professionIT professionLaw
## -0.03975636 -0.33546462
## professionmanage professionMarketing
## -1.11650093 -0.64164375
## professionPR professionSales
## -0.71016173 -0.49774936
## professionTeaching trafficempjs
## -0.54065998 -0.71817392
## trafficfriends trafficKA
## -0.02755587 -0.09380616
## trafficrabrecNErab trafficrecNErab
## -0.37274217 0.08640852
## trafficreferal trafficyoujs
## -0.23576450 -0.49756629
## greywagewhite waycar
## 0.42084521 0.14436829
## wayfoot selfcontrol
## 0.29015023 0.05623800
## anxiety
## 0.04809703
##
## Scale= 0.8428331
##
## Loglik(model)= -2928.2 Loglik(intercept only)= -3014.3
## Chisq= 172.11 on 42 degrees of freedom, p= <2e-16
## n= 1129
The model including independent variables like age,
industry, profession, traffic,
greywage, self-control, and
anxiety produced the best accurate survival model,
according to the abovementioned result. As a result, we will incorporate
these variables into the new model.
model1<-survreg(Surv(stag, event)~age + industry + profession + traffic + greywage + way + selfcontrol + anxiety,
data = df,
dist = "weibull")
summary(model1)
##
## Call:
## survreg(formula = Surv(stag, event) ~ age + industry + profession +
## traffic + greywage + way + selfcontrol + anxiety, data = df,
## dist = "weibull")
## Value Std. Error z p
## (Intercept) 4.43132 0.62299 7.11 1.1e-12
## age -0.01854 0.00533 -3.48 0.00051
## industryAgriculture -0.73601 0.45308 -1.62 0.10427
## industryBanks -0.42850 0.36440 -1.18 0.23964
## industryBuilding -0.47008 0.38380 -1.22 0.22065
## industryConsult -0.30034 0.37685 -0.80 0.42546
## industryetc -0.14456 0.37051 -0.39 0.69642
## industryIT 0.36525 0.37923 0.96 0.33548
## industrymanufacture 0.06281 0.36551 0.17 0.86355
## industryMining -0.10001 0.43080 -0.23 0.81643
## industryPharma 0.07751 0.43843 0.18 0.85967
## industryPowerGeneration 0.16089 0.41584 0.39 0.69883
## industryRealEstate 0.83143 0.52932 1.57 0.11624
## industryRetail 0.21318 0.35890 0.59 0.55252
## industryState -0.06446 0.39766 -0.16 0.87124
## industryTelecom 0.36246 0.41775 0.87 0.38559
## industrytransport 0.02260 0.40834 0.06 0.95586
## professionBusinessDevelopment -0.50213 0.42064 -1.19 0.23259
## professionCommercial -0.82616 0.41869 -1.97 0.04847
## professionConsult -0.50601 0.42676 -1.19 0.23574
## professionEngineer -0.84672 0.44316 -1.91 0.05605
## professionetc -0.40826 0.40485 -1.01 0.31325
## professionFinance -0.17684 0.43121 -0.41 0.68174
## professionHR -0.21914 0.35604 -0.62 0.53823
## professionIT -0.03976 0.39811 -0.10 0.92045
## professionLaw -0.33546 0.53828 -0.62 0.53315
## professionmanage -1.11650 0.41709 -2.68 0.00743
## professionMarketing -0.64164 0.40099 -1.60 0.10957
## professionPR -0.71016 0.53669 -1.32 0.18576
## professionSales -0.49775 0.38223 -1.30 0.19284
## professionTeaching -0.54066 0.47873 -1.13 0.25874
## trafficempjs -0.71817 0.25848 -2.78 0.00546
## trafficfriends -0.02756 0.28014 -0.10 0.92164
## trafficKA -0.09381 0.29140 -0.32 0.74752
## trafficrabrecNErab -0.37274 0.25591 -1.46 0.14525
## trafficrecNErab 0.08641 0.31364 0.28 0.78293
## trafficreferal -0.23576 0.26974 -0.87 0.38209
## trafficyoujs -0.49757 0.25539 -1.95 0.05138
## greywagewhite 0.42085 0.11094 3.79 0.00015
## waycar 0.14437 0.08448 1.71 0.08748
## wayfoot 0.29015 0.14447 2.01 0.04460
## selfcontrol 0.05624 0.01884 2.98 0.00284
## anxiety 0.04810 0.02176 2.21 0.02706
## Log(scale) -0.17099 0.03314 -5.16 2.5e-07
##
## Scale= 0.843
##
## Weibull distribution
## Loglik(model)= -2928.2 Loglik(intercept only)= -3014.3
## Chisq= 172.11 on 42 degrees of freedom, p= 1.1e-17
## Number of Newton-Raphson Iterations: 7
## n= 1129
The hazard ratio compares event rates to determine the proportional risk of the complication. Run the code below to obtain the hazard ratio.
hr=exp(model1$coefficients)
hr
## (Intercept) age
## 84.0420217 0.9816289
## industryAgriculture industryBanks
## 0.4790202 0.6514843
## industryBuilding industryConsult
## 0.6249516 0.7405641
## industryetc industryIT
## 0.8654062 1.4408674
## industrymanufacture industryMining
## 1.0648281 0.9048301
## industryPharma industryPowerGeneration
## 1.0805936 1.1745538
## industryRealEstate industryRetail
## 2.2966106 1.2376091
## industryState industryTelecom
## 0.9375766 1.4368590
## industrytransport professionBusinessDevelopment
## 1.0228601 0.6052423
## professionCommercial professionConsult
## 0.4377263 0.6028965
## professionEngineer professionetc
## 0.4288205 0.6648032
## professionFinance professionHR
## 0.8379172 0.8032105
## professionIT professionLaw
## 0.9610236 0.7150058
## professionmanage professionMarketing
## 0.3274235 0.5264264
## professionPR professionSales
## 0.4915647 0.6078973
## professionTeaching trafficempjs
## 0.5823638 0.4876419
## trafficfriends trafficKA
## 0.9728203 0.9104592
## trafficrabrecNErab trafficrecNErab
## 0.6888428 1.0902516
## trafficreferal trafficyoujs
## 0.7899667 0.6080086
## greywagewhite waycar
## 1.5232485 1.1553095
## wayfoot selfcontrol
## 1.3366283 1.0578494
## anxiety
## 1.0492725