In today’s fast-paced work environments, retaining employees has become a top priority for organizations worldwide. When a team member leaves the company, this not only causes a disruption in workflows, but also leads to an increase in terms of costs that need to be spent on recruitment, training as well as necessary time to get along with other team members that decrease the overall productivity. As there’s a rising demand on identifying and tackling this challenge, many companies are increasingly turning to sophisticated analytical methods to uncover the underlying reasons behind employee turnover and allows for the development of predictive models to anticipate future departures. In this project, I will focus on using Survival Analysis for this problem with its advantages in modeling the employee’s risk of quitting from the companies based on a real dataset provided by Edward Babushkin, who examines the importance of this method instead of using logistic regression for risk assessment.
The dataset contains the following variables:
f: Femalem: Maleadvert: Employee contacted the company directly after
learning about it (through ads, company branding, etc.)recNErab: Contacted the company on the recommendation
of a non-employee friendreferal: Contacted the company on the recommendation of
an employee friendyoujs: Applied for a vacancy on a job siteKA: Recruited through a recruitment agencyfriends: Invited by the employer, with prior
acquaintancerabrecNErab: Employer contacted the employee on the
recommendation of a non-employee acquaintanceempjs: Employer found the employee through their resume
on a job site# Read the data
emp_df <- read.csv("employee/turnover.csv")
# Show the data's structure
str(emp_df)
## 'data.frame': 1129 obs. of 16 variables:
## $ stag : num 7.03 22.97 15.93 15.93 8.41 ...
## $ event : int 1 1 1 1 1 1 1 1 1 1 ...
## $ gender : chr "m" "m" "f" "f" ...
## $ age : num 35 33 35 35 32 42 42 28 29 30 ...
## $ industry : chr "Banks" "Banks" "PowerGeneration" "PowerGeneration" ...
## $ profession : chr "HR" "HR" "HR" "HR" ...
## $ traffic : chr "rabrecNErab" "empjs" "rabrecNErab" "rabrecNErab" ...
## $ coach : chr "no" "no" "no" "no" ...
## $ head_gender : chr "f" "m" "m" "m" ...
## $ greywage : chr "white" "white" "white" "white" ...
## $ way : chr "bus" "bus" "bus" "bus" ...
## $ extraversion: num 6.2 6.2 6.2 5.4 3 6.2 6.2 3.8 8.6 5.4 ...
## $ independ : num 4.1 4.1 6.2 7.6 4.1 6.2 6.2 5.5 6.9 5.5 ...
## $ selfcontrol : num 5.7 5.7 2.6 4.9 8 4.1 4.1 8 2.6 3.3 ...
## $ anxiety : num 7.1 7.1 4.8 2.5 7.1 5.6 5.6 4 4 7.9 ...
## $ novator : num 8.3 8.3 8.3 6.7 3.7 6.7 6.7 4.4 7.5 8.3 ...
# Sample the first 6 rows of the data
head(emp_df)
## stag event gender age industry profession traffic coach
## 1 7.030801 1 m 35 Banks HR rabrecNErab no
## 2 22.965092 1 m 33 Banks HR empjs no
## 3 15.934292 1 f 35 PowerGeneration HR rabrecNErab no
## 4 15.934292 1 f 35 PowerGeneration HR rabrecNErab no
## 5 8.410678 1 m 32 Retail Commercial youjs yes
## 6 8.969199 1 f 42 manufacture HR empjs yes
## head_gender greywage way extraversion independ selfcontrol anxiety novator
## 1 f white bus 6.2 4.1 5.7 7.1 8.3
## 2 m white bus 6.2 4.1 5.7 7.1 8.3
## 3 m white bus 6.2 6.2 2.6 4.8 8.3
## 4 m white bus 5.4 7.6 4.9 2.5 6.7
## 5 f white bus 3.0 4.1 8.0 7.1 3.7
## 6 m white bus 6.2 6.2 4.1 5.6 6.7
# Rename the columns for better representation
colnames(emp_df) <- c("exp","event","self_gender","age","field","prof","route","coach","head_gender","wage","trans","score_extro","score_idpt","score_selfctrl","score_anx","score_innov")
# Categorize feature based on their data types
cat_features <- c("self_gender", "field", "prof", "route", "coach", "head_gender", "wage", "trans")
num_features <- colnames(emp_df)[!c(colnames(emp_df) %in% cat_features)]
num_features
## [1] "exp" "event" "age" "score_extro"
## [5] "score_idpt" "score_selfctrl" "score_anx" "score_innov"
First, we can investigate the unique values of each categorical variable.
# Unique values of each categorical features
for (col in cat_features){
print(table(emp_df[col]))
}
## self_gender
## f m
## 853 276
## field
## HoReCa Agriculture Banks Building Consult
## 11 15 114 41 74
## etc IT manufacture Mining Pharma
## 94 122 145 24 20
## PowerGeneration RealEstate Retail State Telecom
## 38 13 289 55 36
## transport
## 38
## prof
## Finan\xf1e Accounting BusinessDevelopment Commercial
## 17 10 27 23
## Consult Engineer etc HR
## 25 15 37 757
## IT Law manage Marketing
## 74 7 22 31
## PR Sales Teaching
## 6 66 12
## route
## advert empjs friends KA rabrecNErab recNErab
## 33 248 118 67 211 39
## referal youjs
## 95 318
## coach
## my head no yes
## 314 683 132
## head_gender
## f m
## 545 584
## wage
## grey white
## 127 1002
## trans
## bus car foot
## 681 331 117
As we can see from a sneek peak of the dataset, during the data collection process, some typographical errors were introduced (e.g., Finance is recorded as Finan1e), which could potentially complicate the analysis and interpretation. To ensure accurate results and clear insights, it is necessary to address these inconsistencies. Additionally, we can refine and re-categorize certain values for easier interpretation based on the provided assumptions. Below are some suggested re-factorings for the categorical variables:
emp_df <- emp_df %>%
mutate(prof = if_else(prof == "Finan\xf1e", "Finance", prof))
emp_df %>% mutate(field = factor(case_match(
field,
.default="others",
"Retail" ~ "retail",
"manufacture" ~ "manufacture",
"IT" ~ "IT",
"Banks" ~ "finance"))) -> emp_df
emp_df %>% mutate(route = factor(case_match(
route,
.default="others",
"empjs" ~ "jobsite_resume",
"friends" ~ "friends",
c("referal", "rabrecNErab", "recNErab") ~ "recommend",
"youjs" ~ "jobsite_apply"))) -> emp_df
for (col in cat_features){
print(table(emp_df[col]))
}
## self_gender
## f m
## 853 276
## field
## finance IT manufacture others retail
## 114 122 145 459 289
## prof
## Accounting BusinessDevelopment Commercial Consult
## 10 27 23 25
## Engineer etc Finance HR
## 15 37 17 757
## IT Law manage Marketing
## 74 7 22 31
## PR Sales Teaching
## 6 66 12
## route
## friends jobsite_apply jobsite_resume others recommend
## 118 318 248 100 345
## coach
## my head no yes
## 314 683 132
## head_gender
## f m
## 545 584
## wage
## grey white
## 127 1002
## trans
## bus car foot
## 681 331 117
# Convert all categorical features to factor data type
for (col in cat_features) {
emp_df[[col]] <- as.factor(emp_df[[col]])
}
## Change appropriate reference group:
emp_df$field <- relevel(emp_df$field, ref = "others")
emp_df$route <- relevel(emp_df$route, ref = "others")
emp_df$coach <- relevel(emp_df$coach, ref = "no")
emp_df$wage <- relevel(emp_df$wage, ref = "grey")
emp_df$trans <- relevel(emp_df$trans, ref = "bus")
In this section, we will explore the dataset to gain initial insights and identify patterns, trends, and potential anomalies. Through descriptive statistics and visualizations, we will highlight key features and prepare the dataset for more advanced analysis.
# Summary statistics for numerical features
summary(emp_df[num_features])
## exp event age score_extro
## Min. : 0.3942 Min. :0.0000 Min. :18.00 Min. : 1.000
## 1st Qu.: 11.7289 1st Qu.:0.0000 1st Qu.:26.00 1st Qu.: 4.600
## Median : 24.3450 Median :1.0000 Median :30.00 Median : 5.400
## Mean : 36.6275 Mean :0.5058 Mean :31.07 Mean : 5.592
## 3rd Qu.: 51.3183 3rd Qu.:1.0000 3rd Qu.:36.00 3rd Qu.: 7.000
## Max. :179.4497 Max. :1.0000 Max. :58.00 Max. :10.000
## score_idpt score_selfctrl score_anx score_innov
## Min. : 1.000 Min. : 1.000 Min. : 1.700 Min. : 1.00
## 1st Qu.: 4.100 1st Qu.: 4.100 1st Qu.: 4.800 1st Qu.: 4.40
## Median : 5.500 Median : 5.700 Median : 5.600 Median : 6.00
## Mean : 5.478 Mean : 5.597 Mean : 5.666 Mean : 5.88
## 3rd Qu.: 6.900 3rd Qu.: 7.200 3rd Qu.: 7.100 3rd Qu.: 7.50
## Max. :10.000 Max. :10.000 Max. :10.000 Max. :10.00
# Set up the plot window to display multiple plots
par(mfrow = c(2, 2))
for (column in num_features) {
hist(emp_df[[column]], main = paste("Histogram of", column), xlab = column, col = "lightblue")
}
We will further look into the distribution of response variable
Experience (exp) for non-censored as well censored
data:
par(mfrow=c(2,1))
hist(as.matrix(emp_df %>% filter(event==1) %>% pull(exp)), xlab="Experience", main="Non-censored data",col = "lightblue")
hist(as.matrix(emp_df %>% filter(event==0) %>% pull(exp)), xlab="Experience", main="Censored data", col = "lightblue")
As we can observe the non-normal skewness from above histogram of the exp (years of experience) for both the uncensored and censored data, it would be reasonable to apply survival analysis for this problem
long_df <- pivot_longer(emp_df, cols =c(self_gender, field, prof, route, coach, head_gender, wage, trans), names_to = "Category", values_to = "Value")
cat_factors <- c("self_gender", "field", "prof", "route", "coach", "head_gender", "wage", "trans")
# Loop through each categorical variable and create a plot
plots <- lapply(cat_factors, function(cat) {
ggplot(data = filter(long_df, Category == cat), aes(x = Value, fill = Value)) +
geom_bar() +
facet_wrap(~event, nrow = 2) +
labs(title = paste("Distribution of", cat, "by Event (censored or not)"), x = cat, y = "Count") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
})
plots
## [[1]]
##
## [[2]]
##
## [[3]]
##
## [[4]]
##
## [[5]]
##
## [[6]]
##
## [[7]]
##
## [[8]]
First, we will investigate whether different groups within the dataset exhibit variations in retention rates (also referred to as survival rates in the context of survival analysis). To do this, we will calculate Kaplan-Meier estimates for each group, enabling a comparison of their retention rates. Additionally, we will conduct a log-rank test to determine whether the observed differences between the groups are statistically significant with predetermined significance level of 0.05.
mod <- survfit(Surv(exp, event)~self_gender, data = emp_df)
plot1 <- ggsurvplot(mod, data = emp_df,
conf.int = FALSE,
ggtheme = theme_minimal(),
pval = TRUE,
pval.method = TRUE) +
ggtitle("Retention curve by Employee's Gender")
mod <- survfit(Surv(exp, event)~field, data = emp_df)
plot2 <- ggsurvplot(mod, data = emp_df,
conf.int = FALSE,
ggtheme = theme_minimal(),
pval = TRUE,
pval.method = TRUE) +
ggtitle("Retention curve by Employee's Industry sector")
mod <- survfit(Surv(exp, event)~prof, data = emp_df)
plot3 <- ggsurvplot(mod, data = emp_df,
conf.int = FALSE,
ggtheme = theme_minimal(),
pval = TRUE,
pval.method = TRUE) +
ggtitle("Retention curve by Employee's Profession")
mod <- survfit(Surv(exp, event)~route, data = emp_df)
plot4 <- ggsurvplot(mod, data = emp_df,
conf.int = FALSE,
ggtheme = theme_minimal(),
pval = TRUE,
pval.method = TRUE) +
ggtitle("Retention curve by Employee's Source came to the Company")
mod <- survfit(Surv(exp, event)~coach, data = emp_df)
plot5 <- ggsurvplot(mod, data = emp_df,
conf.int = FALSE,
ggtheme = theme_minimal(),
pval = TRUE,
pval.method = TRUE) +
ggtitle("Retention curve by Employee's Coaching/Training availability")
mod <- survfit(Surv(exp, event)~head_gender, data = emp_df)
plot6 <- ggsurvplot(mod, data = emp_df,
conf.int = FALSE,
ggtheme = theme_minimal(),
pval = TRUE,
pval.method = TRUE) +
ggtitle("Retention curve by Gender of the employee's supervisor")
mod <- survfit(Surv(exp, event)~wage, data = emp_df)
plot7 <- ggsurvplot(mod, data = emp_df,
conf.int = FALSE,
ggtheme = theme_minimal(),
pval = TRUE,
pval.method = TRUE) +
ggtitle("Retention curve by Employee's Wage (Tax reported or not)")
mod <- survfit(Surv(exp, event)~trans, data = emp_df)
plot8 <- ggsurvplot(mod, data = emp_df,
conf.int = FALSE,
ggtheme = theme_minimal(),
pval = TRUE,
pval.method = TRUE) +
ggtitle("Retention curve by Employee's mode of Transportation")
plot1$plot; plot2$plot; plot3$plot; plot4$plot;
plot5$plot; plot6$plot; plot7$plot; plot8$plot
Based on the following results, we can confirm that there’s difference in terms of retention rate for the following variable: - Industry Sector (field) - Profession (prof) - Source came to Company (route) - Wage tax authorized or not (wage) - Mode of Transportation (trans)
We need to determine which distribution the Experience (exp) follows before applying parametric method such as Accelerated Failure Time (AFT) model.We will examine two popular assumed form of Experience (exp) Weibull, Exponential distribution.
library(goftest)
fitdistr(emp_df$exp, "weibull")
## shape scale
## 1.08272153 37.78782884
## ( 0.02503311) ( 1.09608991)
weibullCDF <- function(x, shape, scale) {
1 - exp(- (x / scale)^shape)
}
# Estimate parameters if not known
shape_hat <- 1.08272153
scale_hat <- 37.78782884
# Perform the Anderson-Darling test for Weibull
result <- ad.test(emp_df$exp, function(x) weibullCDF(x, shape = shape_hat, scale = scale_hat))
# Print the test result
print(result)
##
## Anderson-Darling test of goodness-of-fit
## Null hypothesis: distribution 'function(x) weibullCDF(x, shape =
## shape_hat, scale = scale_hat)'
## Parameters assumed to be fixed
##
## data: emp_df$exp
## An = 1.3391, p-value = 0.22
# Fit exponential model
exp_model <- survreg(Surv(exp, event) ~ 1, data = emp_df, dist = 'exponential')
lambda_hat <- 1 / exp_model$scale # Rate parameter
# Anderson-Darling test for exponentiality
result <- ad.test(emp_df$exp, "pexp", rate = lambda_hat)
# Print the result
print(result)
##
## Anderson-Darling test of goodness-of-fit
## Null hypothesis: exponential distribution
## with parameter rate = 1
## Parameters assumed to be fixed
##
## data: emp_df$exp
## An = Inf, p-value = 5.314e-07
After fitting the exponential model with a scale parameter of 1, we obtain a p-value of less than 0.05, leading us to reject the null hypothesis that the exponential distribution with a rate parameter of 1 is appropriate for this model. Next, we test the Weibull distribution, using the estimated scale and shape parameters derived from the exp variable in the dataset. The p-value from the Anderson-Darling Goodness-of-Fit test is 0.22, indicating that we fail to reject the null hypothesis. Thus, we proceed with the Weibull distribution as the assumption for the AFT model as below.
We will define the base model that included all predictors, then perform step-wise regression for finding the best combinations of predictors (based on the criteria of lowest AIC)
base_mod<-survreg(Surv(exp, event)~.,data = emp_df, dist = "weibull")
step(base_mod)
## Start: AIC=5956.48
## Surv(exp, event) ~ self_gender + age + field + prof + route +
## coach + head_gender + wage + trans + score_extro + score_idpt +
## score_selfctrl + score_anx + score_innov
##
## Df AIC
## - score_innov 1 5954.5
## - self_gender 1 5954.7
## - score_idpt 1 5954.7
## - score_extro 1 5955.0
## - head_gender 1 5955.4
## - coach 2 5955.9
## <none> 5956.5
## - score_selfctrl 1 5957.0
## - prof 14 5957.7
## - score_anx 1 5957.8
## - trans 2 5961.3
## - age 1 5965.0
## - wage 1 5966.6
## - field 4 5979.1
## - route 4 5981.4
##
## Step: AIC=5954.5
## Surv(exp, event) ~ self_gender + age + field + prof + route +
## coach + head_gender + wage + trans + score_extro + score_idpt +
## score_selfctrl + score_anx
##
## Df AIC
## - self_gender 1 5952.7
## - score_idpt 1 5952.8
## - score_extro 1 5953.0
## - head_gender 1 5953.4
## - coach 2 5953.9
## <none> 5954.5
## - score_selfctrl 1 5955.4
## - prof 14 5955.7
## - score_anx 1 5956.0
## - trans 2 5959.3
## - age 1 5963.0
## - wage 1 5964.6
## - field 4 5977.3
## - route 4 5979.5
##
## Step: AIC=5952.68
## Surv(exp, event) ~ age + field + prof + route + coach + head_gender +
## wage + trans + score_extro + score_idpt + score_selfctrl +
## score_anx
##
## Df AIC
## - score_idpt 1 5951.0
## - score_extro 1 5951.1
## - head_gender 1 5951.5
## - coach 2 5952.0
## <none> 5952.7
## - score_selfctrl 1 5953.8
## - prof 14 5954.2
## - score_anx 1 5954.8
## - trans 2 5957.6
## - age 1 5961.2
## - wage 1 5963.0
## - field 4 5976.1
## - route 4 5977.6
##
## Step: AIC=5950.97
## Surv(exp, event) ~ age + field + prof + route + coach + head_gender +
## wage + trans + score_extro + score_selfctrl + score_anx
##
## Df AIC
## - head_gender 1 5949.8
## - score_extro 1 5950.1
## - coach 2 5950.4
## <none> 5951.0
## - score_selfctrl 1 5952.1
## - prof 14 5952.2
## - score_anx 1 5953.4
## - trans 2 5955.9
## - age 1 5959.3
## - wage 1 5961.5
## - field 4 5974.3
## - route 4 5975.9
##
## Step: AIC=5949.83
## Surv(exp, event) ~ age + field + prof + route + coach + wage +
## trans + score_extro + score_selfctrl + score_anx
##
## Df AIC
## - score_extro 1 5949.0
## - coach 2 5949.4
## <none> 5949.8
## - score_selfctrl 1 5951.0
## - prof 14 5951.9
## - score_anx 1 5952.7
## - trans 2 5955.0
## - age 1 5959.8
## - wage 1 5960.4
## - field 4 5972.7
## - route 4 5974.0
##
## Step: AIC=5949.01
## Surv(exp, event) ~ age + field + prof + route + coach + wage +
## trans + score_selfctrl + score_anx
##
## Df AIC
## - coach 2 5948.8
## <none> 5949.0
## - prof 14 5951.1
## - score_anx 1 5953.3
## - trans 2 5954.4
## - score_selfctrl 1 5955.3
## - age 1 5958.1
## - wage 1 5959.4
## - field 4 5972.0
## - route 4 5974.1
##
## Step: AIC=5948.76
## Surv(exp, event) ~ age + field + prof + route + wage + trans +
## score_selfctrl + score_anx
##
## Df AIC
## <none> 5948.8
## - prof 14 5951.4
## - score_anx 1 5952.8
## - trans 2 5953.6
## - score_selfctrl 1 5955.6
## - age 1 5959.3
## - wage 1 5959.5
## - field 4 5971.9
## - route 4 5972.6
## Call:
## survreg(formula = Surv(exp, event) ~ age + field + prof + route +
## wage + trans + score_selfctrl + score_anx, data = emp_df,
## dist = "weibull")
##
## Coefficients:
## (Intercept) age fieldfinance
## 4.290954476 -0.019667910 -0.331611466
## fieldIT fieldmanufacture fieldretail
## 0.460216178 0.133770055 0.294066748
## profBusinessDevelopment profCommercial profConsult
## -0.331475026 -0.816549442 -0.746764123
## profEngineer profetc profFinance
## -0.771198398 -0.477817033 -0.259299256
## profHR profIT profLaw
## -0.256849028 -0.025093874 -0.447873317
## profmanage profMarketing profPR
## -1.096702978 -0.734964234 -0.755396244
## profSales profTeaching routefriends
## -0.503692717 -0.460266310 0.005788691
## routejobsite_apply routejobsite_resume routerecommend
## -0.403216519 -0.652138777 -0.191406994
## wagewhite transcar transfoot
## 0.424160983 0.164941113 0.362512604
## score_selfctrl score_anx
## 0.056083991 0.053587423
##
## Scale= 0.8594605
##
## Loglik(model)= -2944.4 Loglik(intercept only)= -3014.3
## Chisq= 139.79 on 28 degrees of freedom, p= <2e-16
## n= 1129
best_mod <- survreg(formula = Surv(exp, event) ~ age + field + prof + route +
wage + trans + score_selfctrl + score_anx, data = emp_df,
dist = "weibull")
summary(best_mod)
##
## Call:
## survreg(formula = Surv(exp, event) ~ age + field + prof + route +
## wage + trans + score_selfctrl + score_anx, data = emp_df,
## dist = "weibull")
## Value Std. Error z p
## (Intercept) 4.29095 0.46798 9.17 < 2e-16
## age -0.01967 0.00544 -3.61 0.00030
## fieldfinance -0.33161 0.11902 -2.79 0.00533
## fieldIT 0.46022 0.16580 2.78 0.00551
## fieldmanufacture 0.13377 0.12396 1.08 0.28054
## fieldretail 0.29407 0.09774 3.01 0.00262
## profBusinessDevelopment -0.33148 0.42033 -0.79 0.43034
## profCommercial -0.81655 0.42315 -1.93 0.05365
## profConsult -0.74676 0.42241 -1.77 0.07708
## profEngineer -0.77120 0.44142 -1.75 0.08062
## profetc -0.47782 0.40574 -1.18 0.23894
## profFinance -0.25930 0.43531 -0.60 0.55140
## profHR -0.25685 0.35998 -0.71 0.47553
## profIT -0.02509 0.40052 -0.06 0.95004
## profLaw -0.44787 0.52706 -0.85 0.39546
## profmanage -1.09670 0.41998 -2.61 0.00902
## profMarketing -0.73496 0.40380 -1.82 0.06874
## profPR -0.75540 0.53079 -1.42 0.15470
## profSales -0.50369 0.38679 -1.30 0.19284
## profTeaching -0.46027 0.44598 -1.03 0.30206
## routefriends 0.00579 0.18479 0.03 0.97501
## routejobsite_apply -0.40322 0.14682 -2.75 0.00603
## routejobsite_resume -0.65214 0.14948 -4.36 1.3e-05
## routerecommend -0.19141 0.13977 -1.37 0.17086
## wagewhite 0.42416 0.11255 3.77 0.00016
## transcar 0.16494 0.08429 1.96 0.05037
## transfoot 0.36251 0.14416 2.51 0.01192
## score_selfctrl 0.05608 0.01881 2.98 0.00287
## score_anx 0.05359 0.02189 2.45 0.01436
## Log(scale) -0.15145 0.03298 -4.59 4.4e-06
##
## Scale= 0.859
##
## Weibull distribution
## Loglik(model)= -2944.4 Loglik(intercept only)= -3014.3
## Chisq= 139.79 on 28 degrees of freedom, p= 8.2e-17
## Number of Newton-Raphson Iterations: 7
## n= 1129
We can see some interesting insights based on the significant variables, the following would show all hazard ratios for each estimate obtained from the model and some interpretations of each significant factor based on the results, controlling other factors remain the same.
haz_rate <- exp(best_mod$coefficients)
haz_rate
## (Intercept) age fieldfinance
## 73.0361465 0.9805242 0.7177661
## fieldIT fieldmanufacture fieldretail
## 1.5844165 1.1431299 1.3418735
## profBusinessDevelopment profCommercial profConsult
## 0.7178641 0.4419540 0.4738975
## profEngineer profetc profFinance
## 0.4624585 0.6201357 0.7715921
## profHR profIT profLaw
## 0.7734850 0.9752184 0.6389856
## profmanage profMarketing profPR
## 0.3339704 0.4795226 0.4698244
## profSales profTeaching routefriends
## 0.6042950 0.6311156 1.0058055
## routejobsite_apply routejobsite_resume routerecommend
## 0.6681674 0.5209304 0.8257964
## wagewhite transcar transfoot
## 1.5283076 1.1793237 1.4369353
## score_selfctrl score_anx
## 1.0576865 1.0550492
Individuals working in the IT field tend to stay at their companies 2.2 times longer than those in finance. Similarly, individuals who apply for jobs through job sites or resumes generally have shorter tenures compared to those who secure positions through referrals from friends. Additionally, employees receiving fully taxed wages (white wages) tend to remain at their companies 52% longer than those receiving untaxed (gray) wages, indicating a longer duration of employment. Higher self-control scores are also associated with longer tenure. Interestingly, individuals who walk to work tend to stay 43.7% longer at their companies compared to those who primarily commute by bus.
In order to derive meaningful conclusions about employee retention with a robust handling of censored data that can serve as a predictive model, we can perform Cox proportional Hazard model (semi-parametric model).
cox_base_mod <- coxph(Surv(exp, event)~.,data = emp_df)
step(cox_base_mod)
## Start: AIC=6871.14
## Surv(exp, event) ~ self_gender + age + field + prof + route +
## coach + head_gender + wage + trans + score_extro + score_idpt +
## score_selfctrl + score_anx + score_innov
##
## Df AIC
## - score_innov 1 6869.1
## - score_idpt 1 6869.3
## - self_gender 1 6869.5
## - score_extro 1 6869.8
## - head_gender 1 6869.9
## - coach 2 6870.3
## - prof 14 6870.7
## - score_selfctrl 1 6871.1
## <none> 6871.1
## - score_anx 1 6872.0
## - trans 2 6876.6
## - age 1 6880.0
## - wage 1 6881.1
## - field 4 6892.5
## - route 4 6892.7
##
## Step: AIC=6869.15
## Surv(exp, event) ~ self_gender + age + field + prof + route +
## coach + head_gender + wage + trans + score_extro + score_idpt +
## score_selfctrl + score_anx
##
## Df AIC
## - score_idpt 1 6867.3
## - self_gender 1 6867.5
## - score_extro 1 6867.8
## - head_gender 1 6867.9
## - coach 2 6868.3
## - prof 14 6868.8
## <none> 6869.1
## - score_selfctrl 1 6869.4
## - score_anx 1 6870.1
## - trans 2 6874.6
## - age 1 6878.0
## - wage 1 6879.2
## - field 4 6890.6
## - route 4 6890.9
##
## Step: AIC=6867.26
## Surv(exp, event) ~ self_gender + age + field + prof + route +
## coach + head_gender + wage + trans + score_extro + score_selfctrl +
## score_anx
##
## Df AIC
## - self_gender 1 6865.6
## - head_gender 1 6866.1
## - coach 2 6866.5
## - score_extro 1 6866.5
## - prof 14 6866.8
## <none> 6867.3
## - score_selfctrl 1 6867.7
## - score_anx 1 6868.8
## - trans 2 6872.8
## - age 1 6876.0
## - wage 1 6877.4
## - field 4 6888.7
## - route 4 6889.0
##
## Step: AIC=6865.63
## Surv(exp, event) ~ age + field + prof + route + coach + head_gender +
## wage + trans + score_extro + score_selfctrl + score_anx
##
## Df AIC
## - head_gender 1 6864.4
## - coach 2 6864.8
## - score_extro 1 6865.0
## - prof 14 6865.0
## <none> 6865.6
## - score_selfctrl 1 6866.3
## - score_anx 1 6867.9
## - trans 2 6871.2
## - age 1 6874.4
## - wage 1 6876.1
## - route 4 6887.2
## - field 4 6887.7
##
## Step: AIC=6864.38
## Surv(exp, event) ~ age + field + prof + route + coach + wage +
## trans + score_extro + score_selfctrl + score_anx
##
## Df AIC
## - coach 2 6863.6
## - score_extro 1 6863.7
## <none> 6864.4
## - prof 14 6864.5
## - score_selfctrl 1 6865.2
## - score_anx 1 6867.1
## - trans 2 6870.1
## - age 1 6874.7
## - wage 1 6874.9
## - route 4 6885.3
## - field 4 6886.2
##
## Step: AIC=6863.64
## Surv(exp, event) ~ age + field + prof + route + wage + trans +
## score_extro + score_selfctrl + score_anx
##
## Df AIC
## - score_extro 1 6863.2
## <none> 6863.6
## - prof 14 6864.3
## - score_selfctrl 1 6864.6
## - score_anx 1 6866.1
## - trans 2 6869.0
## - wage 1 6874.5
## - age 1 6875.4
## - route 4 6883.6
## - field 4 6885.6
##
## Step: AIC=6863.18
## Surv(exp, event) ~ age + field + prof + route + wage + trans +
## score_selfctrl + score_anx
##
## Df AIC
## <none> 6863.2
## - prof 14 6863.8
## - score_anx 1 6867.1
## - trans 2 6868.6
## - score_selfctrl 1 6869.6
## - wage 1 6873.6
## - age 1 6873.8
## - route 4 6884.1
## - field 4 6885.1
## Call:
## coxph(formula = Surv(exp, event) ~ age + field + prof + route +
## wage + trans + score_selfctrl + score_anx, data = emp_df)
##
## coef exp(coef) se(coef) z p
## age 0.0229722 1.0232381 0.0063668 3.608 0.000308
## fieldfinance 0.3801051 1.4624382 0.1384227 2.746 0.006033
## fieldIT -0.5344071 0.5860166 0.1927055 -2.773 0.005551
## fieldmanufacture -0.1430499 0.8667108 0.1441057 -0.993 0.320869
## fieldretail -0.3322475 0.7173097 0.1145145 -2.901 0.003716
## profBusinessDevelopment 0.3706328 1.4486511 0.4894983 0.757 0.448949
## profCommercial 0.9380492 2.5549923 0.4931655 1.902 0.057158
## profConsult 0.8220196 2.2750899 0.4929325 1.668 0.095393
## profEngineer 0.8683370 2.3829446 0.5138028 1.690 0.091024
## profetc 0.5089503 1.6635441 0.4737691 1.074 0.282707
## profFinance 0.2101460 1.2338582 0.5103214 0.412 0.680492
## profHR 0.2733241 1.3143262 0.4198372 0.651 0.515031
## profIT 0.0326334 1.0331718 0.4667599 0.070 0.944261
## profLaw 0.4650626 1.5921138 0.6153673 0.756 0.449800
## profmanage 1.2247859 3.4034373 0.4910183 2.494 0.012618
## profMarketing 0.7951742 2.2148268 0.4713507 1.687 0.091601
## profPR 0.8544298 2.3500340 0.6193306 1.380 0.167709
## profSales 0.5113095 1.6674734 0.4512538 1.133 0.257178
## profTeaching 0.5447424 1.7241642 0.5186051 1.050 0.293535
## routefriends -0.0004327 0.9995674 0.2156620 -0.002 0.998399
## routejobsite_apply 0.4623123 1.5877411 0.1731386 2.670 0.007581
## routejobsite_resume 0.7291377 2.0732920 0.1764617 4.132 3.6e-05
## routerecommend 0.2287965 1.2570862 0.1632106 1.402 0.160961
## wagewhite -0.4886619 0.6134467 0.1315418 -3.715 0.000203
## transcar -0.2026201 0.8165884 0.0990296 -2.046 0.040751
## transfoot -0.4316186 0.6494570 0.1675262 -2.576 0.009983
## score_selfctrl -0.0635267 0.9384491 0.0219958 -2.888 0.003875
## score_anx -0.0618985 0.9399783 0.0255270 -2.425 0.015316
##
## Likelihood ratio test=133.4 on 28 df, p=1.098e-15
## n= 1129, number of events= 571
best_cox_mod <- coxph(formula = Surv(exp, event) ~ age + field + prof + route +
wage + trans + score_selfctrl + score_anx, data = emp_df)
Since we assumes that the hazard ratios are proportional over time, not assuming a specific form for the baseline hazard function, we will check for proportional hazard assumptions of this model as below:
# PH-test
ph_test <- cox.zph(best_cox_mod)
print(ph_test)
## chisq df p
## age 0.845 1 0.35798
## field 2.546 4 0.63648
## prof 40.331 14 0.00023
## route 9.072 4 0.05932
## wage 0.778 1 0.37768
## trans 2.270 2 0.32144
## score_selfctrl 0.494 1 0.48200
## score_anx 1.192 1 0.27489
## GLOBAL 55.074 28 0.00167
plot(ph_test)
# Stratify the model by the covariate that violates the assumption
cox_model_strat <- coxph(Surv(exp, event) ~ age + field + strata(prof) + route +
wage + trans + score_selfctrl + score_anx, data = emp_df)
summary(cox_model_strat)
## Call:
## coxph(formula = Surv(exp, event) ~ age + field + strata(prof) +
## route + wage + trans + score_selfctrl + score_anx, data = emp_df)
##
## n= 1129, number of events= 571
##
## coef exp(coef) se(coef) z Pr(>|z|)
## age 0.023977 1.024266 0.006529 3.672 0.000240 ***
## fieldfinance 0.351325 1.420950 0.139990 2.510 0.012085 *
## fieldIT -0.517867 0.595790 0.195140 -2.654 0.007959 **
## fieldmanufacture -0.200800 0.818076 0.147234 -1.364 0.172626
## fieldretail -0.314352 0.730262 0.115144 -2.730 0.006332 **
## routefriends -0.037390 0.963301 0.225313 -0.166 0.868199
## routejobsite_apply 0.433081 1.542001 0.177583 2.439 0.014738 *
## routejobsite_resume 0.682893 1.979596 0.180117 3.791 0.000150 ***
## routerecommend 0.249538 1.283432 0.168212 1.483 0.137948
## wagewhite -0.469301 0.625440 0.133487 -3.516 0.000439 ***
## transcar -0.171671 0.842257 0.100130 -1.714 0.086443 .
## transfoot -0.421441 0.656100 0.176033 -2.394 0.016661 *
## score_selfctrl -0.064877 0.937183 0.022436 -2.892 0.003833 **
## score_anx -0.060366 0.941420 0.025798 -2.340 0.019285 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## age 1.0243 0.9763 1.0112 1.0375
## fieldfinance 1.4209 0.7038 1.0800 1.8696
## fieldIT 0.5958 1.6784 0.4064 0.8734
## fieldmanufacture 0.8181 1.2224 0.6130 1.0917
## fieldretail 0.7303 1.3694 0.5827 0.9151
## routefriends 0.9633 1.0381 0.6194 1.4981
## routejobsite_apply 1.5420 0.6485 1.0887 2.1840
## routejobsite_resume 1.9796 0.5052 1.3908 2.8177
## routerecommend 1.2834 0.7792 0.9230 1.7847
## wagewhite 0.6254 1.5989 0.4815 0.8125
## transcar 0.8423 1.1873 0.6922 1.0249
## transfoot 0.6561 1.5242 0.4647 0.9264
## score_selfctrl 0.9372 1.0670 0.8969 0.9793
## score_anx 0.9414 1.0622 0.8950 0.9902
##
## Concordance= 0.635 (se = 0.016 )
## Likelihood ratio test= 96.17 on 14 df, p=3e-14
## Wald test = 99.54 on 14 df, p=6e-15
## Score (logrank) test = 101.5 on 14 df, p=2e-15
After applying stratify for the prof variables, we have the proportional hazard assumption holds for the model.
# Test the proportional hazards assumption
ph_test <- cox.zph(cox_model_strat)
# Print the test results
print(ph_test)
## chisq df p
## age 2.48e-01 1 0.619
## field 1.04e+00 4 0.903
## route 8.47e+00 4 0.076
## wage 2.33e-01 1 0.629
## trans 1.06e+00 2 0.589
## score_selfctrl 4.82e-05 1 0.994
## score_anx 6.95e-01 1 0.404
## GLOBAL 1.30e+01 14 0.524
# Plot the Schoenfeld residuals for each covariate
plot(ph_test)
We have the following insights for the aboeve Cox proportional hazard model: + Employees hired through jobsite applications have a 54.2% higher risk of quitting compared to those from other routes. This may suggest that employees who apply online might be less committed or have more options + Employees hired by submitting resumes to job sites face almost double the risk of quitting compared to other sources, likely due to higher mobility or availability of alternative job offers + Employees receiving white (tax-authorized) wages have a 37.5% lower risk of quitting compared to those paid under a grey wage system. This suggests that legal, tax-compliant wage structures contribute to better retention, likely due to higher job security and benefits. + Employees who commute on foot have a 34.4% lower risk of quitting compared to bus commuters. This could suggest that employees living closer to work may have greater job stability or satisfaction. + For each unit increase in self-control, the risk of quitting decreases by 6.3%. Employees with higher self-control are likely to stay longer, which could be due to their ability to manage work stress and maintain focus on long-term goals. + For each unit increase in anxiety, the risk of quitting decreases by 5.9%. This counterintuitive result suggests that more anxious employees might stay longer, possibly due to a fear of change or job loss.
# Define function for customizing the output for Cox proportional hazard model's results
output_custom <- function(model, exp = FALSE){
# Extract coefficients
coeffs <- summary(model)$coefficients
output_df <- data.frame(
estimate = coeffs[,1],
std_err = coeffs[,3],
p_val = coeffs[,5],
significant = coeffs[,5] <= 0.05
)
output_df[,1:3] <- round(output_df[,1:3], 3)
if (exp == TRUE){
output_df$exp_est <- round(coeffs[,2], 3)
output_df <- output_df %>% relocate(exp_est, .after = estimate)
}
return (output_df)
}
output_custom(cox_model_strat, exp = TRUE)
## estimate exp_est std_err p_val significant
## age 0.024 1.024 0.007 0.000 TRUE
## fieldfinance 0.351 1.421 0.140 0.012 TRUE
## fieldIT -0.518 0.596 0.195 0.008 TRUE
## fieldmanufacture -0.201 0.818 0.147 0.173 FALSE
## fieldretail -0.314 0.730 0.115 0.006 TRUE
## routefriends -0.037 0.963 0.225 0.868 FALSE
## routejobsite_apply 0.433 1.542 0.178 0.015 TRUE
## routejobsite_resume 0.683 1.980 0.180 0.000 TRUE
## routerecommend 0.250 1.283 0.168 0.138 FALSE
## wagewhite -0.469 0.625 0.133 0.000 TRUE
## transcar -0.172 0.842 0.100 0.086 FALSE
## transfoot -0.421 0.656 0.176 0.017 TRUE
## score_selfctrl -0.065 0.937 0.022 0.004 TRUE
## score_anx -0.060 0.941 0.026 0.019 TRUE
# Calculate the concordance index
c_index_cox <- concordance(cox_model_strat)
print(c_index_cox$concordance)
## [1] 0.6352516
cox_model_strat_final <- coxph(Surv(exp, event) ~ age + field + strata(prof) + route +
coach + head_gender + wage + trans + score_selfctrl + score_anx +score_anx, data = emp_df)
summary(cox_model_strat_final)
## Call:
## coxph(formula = Surv(exp, event) ~ age + field + strata(prof) +
## route + coach + head_gender + wage + trans + score_selfctrl +
## score_anx + score_anx, data = emp_df)
##
## n= 1129, number of events= 571
##
## coef exp(coef) se(coef) z Pr(>|z|)
## age 0.022247 1.022496 0.006919 3.215 0.001304 **
## fieldfinance 0.335279 1.398331 0.141756 2.365 0.018021 *
## fieldIT -0.541747 0.581731 0.195788 -2.767 0.005657 **
## fieldmanufacture -0.200710 0.818150 0.149028 -1.347 0.178048
## fieldretail -0.323286 0.723767 0.115266 -2.805 0.005036 **
## routefriends -0.034195 0.966383 0.226005 -0.151 0.879738
## routejobsite_apply 0.442925 1.557255 0.178815 2.477 0.013249 *
## routejobsite_resume 0.713682 2.041494 0.181911 3.923 8.74e-05 ***
## routerecommend 0.248957 1.282687 0.168657 1.476 0.139912
## coachmy head -0.081567 0.921671 0.110288 -0.740 0.459554
## coachyes 0.168370 1.183375 0.139315 1.209 0.226833
## head_genderm 0.129126 1.137834 0.098767 1.307 0.191084
## wagewhite -0.466276 0.627334 0.134182 -3.475 0.000511 ***
## transcar -0.174091 0.840220 0.100448 -1.733 0.083070 .
## transfoot -0.428388 0.651558 0.176444 -2.428 0.015186 *
## score_selfctrl -0.061366 0.940479 0.022521 -2.725 0.006434 **
## score_anx -0.057210 0.944396 0.026122 -2.190 0.028518 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## age 1.0225 0.9780 1.0087 1.0365
## fieldfinance 1.3983 0.7151 1.0591 1.8462
## fieldIT 0.5817 1.7190 0.3963 0.8538
## fieldmanufacture 0.8181 1.2223 0.6109 1.0957
## fieldretail 0.7238 1.3817 0.5774 0.9072
## routefriends 0.9664 1.0348 0.6205 1.5050
## routejobsite_apply 1.5573 0.6422 1.0969 2.2109
## routejobsite_resume 2.0415 0.4898 1.4292 2.9160
## routerecommend 1.2827 0.7796 0.9216 1.7852
## coachmy head 0.9217 1.0850 0.7425 1.1441
## coachyes 1.1834 0.8450 0.9006 1.5549
## head_genderm 1.1378 0.8789 0.9376 1.3809
## wagewhite 0.6273 1.5940 0.4823 0.8160
## transcar 0.8402 1.1902 0.6901 1.0230
## transfoot 0.6516 1.5348 0.4611 0.9208
## score_selfctrl 0.9405 1.0633 0.8999 0.9829
## score_anx 0.9444 1.0589 0.8973 0.9940
##
## Concordance= 0.641 (se = 0.015 )
## Likelihood ratio test= 100.7 on 17 df, p=7e-14
## Wald test = 104.3 on 17 df, p=1e-14
## Score (logrank) test = 106.2 on 17 df, p=6e-15
With a C-index of 0.635 indicates the Cox proportional hazard model is better than random but has moderate predictive ability. Therefore, we can consider the non-linear model such as Random Forest Survival models and Gradient Boosting Survival models.
library(tools4uplift)
## Warning: package 'tools4uplift' was built under R version 4.3.3
Based on the above results, since we confirm that there’s difference in terms of retention rate with the log-rank test for the following variable: - Industry Sector (field) - Profession (prof) - Source came to Company (route) - Wage tax authorized or not (wage) - Mode of Transportation (trans)
We will conduct quasi-experimental analysis for each of those factors (treat each of those variables as potential treatment-like interventions that might affect employee retention)
We have a hypothesis that different industries may respond differently to retention strategies like training or employee engagement program with better communication.
We will test if employees in retail or finance sectors have better retention when given more flexible schedules or additional professional development.
temp <- data.frame(emp_df)
temp$treatment <- ifelse(temp$field %in% c('finance', 'IT'), 1, 0)
head(temp)
## exp event self_gender age field prof route coach
## 1 7.030801 1 m 35 finance HR recommend no
## 2 22.965092 1 m 33 finance HR jobsite_resume no
## 3 15.934292 1 f 35 others HR recommend no
## 4 15.934292 1 f 35 others HR recommend no
## 5 8.410678 1 m 32 retail Commercial jobsite_apply yes
## 6 8.969199 1 f 42 manufacture HR jobsite_resume yes
## head_gender wage trans score_extro score_idpt score_selfctrl score_anx
## 1 f white bus 6.2 4.1 5.7 7.1
## 2 m white bus 6.2 4.1 5.7 7.1
## 3 m white bus 6.2 6.2 2.6 4.8
## 4 m white bus 5.4 7.6 4.9 2.5
## 5 f white bus 3.0 4.1 8.0 7.1
## 6 m white bus 6.2 6.2 4.1 5.6
## score_innov treatment
## 1 8.3 1
## 2 8.3 1
## 3 8.3 0
## 4 6.7 0
## 5 3.7 0
## 6 6.7 0
table(emp_df$route)
##
## others friends jobsite_apply jobsite_resume recommend
## 100 118 318 248 345
# Recreate emp_df_copy to avoid modifying the original data
emp_df_copy <- temp[, c("treatment", "event", "route", "wage", "trans", "score_selfctrl", "score_anx")]
# Ensure factor levels in the dataset are consistent
emp_df_copy$route <- factor(emp_df_copy$route)
emp_df_copy$trans <- factor(emp_df_copy$trans)
# Fit the two-model uplift regression
uplift_model <- DualUplift(emp_df_copy, treat = "treatment", outcome = "event",
predictors = c("route", "wage", "trans", "score_selfctrl", "score_anx"))
print(uplift_model)
## $call
## DualUplift(data = emp_df_copy, treat = "treatment", outcome = "event",
## predictors = c("route", "wage", "trans", "score_selfctrl",
## "score_anx"))
##
## $control
##
## Call: "Control Group Logistic Model"
##
## Coefficients:
## (Intercept) routefriends routejobsite_apply
## 1.232908 -0.529325 -0.289496
## routejobsite_resume routerecommend wagewhite
## -0.009535 0.302337 -0.239487
## transcar transfoot score_selfctrl
## -0.055595 -0.558426 -0.070139
## score_anx
## -0.075524
##
## Degrees of Freedom: 892 Total (i.e. Null); 883 Residual
## Null Deviance: 1237
## Residual Deviance: 1205 AIC: 1225
##
## $treatment
##
## Call: "Treatment Group Logistic Model"
##
## Coefficients:
## (Intercept) routefriends routejobsite_apply
## 0.01622 -1.46029 -0.13893
## routejobsite_resume routerecommend wagewhite
## 0.41459 0.83327 -0.12025
## transcar transfoot score_selfctrl
## -0.04992 -0.79294 0.07835
## score_anx
## -0.10638
##
## Degrees of Freedom: 235 Total (i.e. Null); 226 Residual
## Null Deviance: 325.8
## Residual Deviance: 300.9 AIC: 320.9
##
## attr(,"class")
## [1] "print.DualUplift"
summary(uplift_model)
## $call
## DualUplift(data = emp_df_copy, treat = "treatment", outcome = "event",
## predictors = c("route", "wage", "trans", "score_selfctrl",
## "score_anx"))
##
## $control
##
## Call:
## "Control Group Logistic Model"
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.232908 0.437729 2.817 0.00485 **
## routefriends -0.529325 0.310754 -1.703 0.08850 .
## routejobsite_apply -0.289496 0.263785 -1.097 0.27244
## routejobsite_resume -0.009535 0.272435 -0.035 0.97208
## routerecommend 0.302337 0.260759 1.159 0.24627
## wagewhite -0.239487 0.211319 -1.133 0.25709
## transcar -0.055595 0.153640 -0.362 0.71746
## transfoot -0.558426 0.240002 -2.327 0.01998 *
## score_selfctrl -0.070139 0.035701 -1.965 0.04946 *
## score_anx -0.075524 0.041371 -1.826 0.06792 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1236.9 on 892 degrees of freedom
## Residual deviance: 1205.0 on 883 degrees of freedom
## AIC: 1225
##
## Number of Fisher Scoring iterations: 4
##
##
## $treatment
##
## Call:
## "Treatment Group Logistic Model"
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.01622 0.95179 0.017 0.9864
## routefriends -1.46029 0.83021 -1.759 0.0786 .
## routejobsite_apply -0.13893 0.60133 -0.231 0.8173
## routejobsite_resume 0.41459 0.61215 0.677 0.4982
## routerecommend 0.83327 0.57672 1.445 0.1485
## wagewhite -0.12025 0.52693 -0.228 0.8195
## transcar -0.04992 0.34758 -0.144 0.8858
## transfoot -0.79294 0.51682 -1.534 0.1250
## score_selfctrl 0.07835 0.07087 1.105 0.2690
## score_anx -0.10638 0.07914 -1.344 0.1789
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 325.79 on 235 degrees of freedom
## Residual deviance: 300.88 on 226 degrees of freedom
## AIC: 320.88
##
## Number of Fisher Scoring iterations: 4
##
##
## attr(,"class")
## [1] "summary.DualUplift"
Insights based on this: transfoot: Employees commuting by foot are significantly more likely to quit in both control and treatment groups. Consider offering support like transportation benefits or flexible working arrangements. score_selfctrl: Low self-control is linked to higher quit rates, especially in the control group. Implement mentoring, goal-setting, or time management programs.
We have a hypothesis that employees in different professions may have different retention risks due to job roles, job market demand, or career development opportunities.
We will test if employees in high demand professions versus others to identify where retention strategies might have most uplift.
# Create grouped 'prof' variable
emp_df$prof_group <- with(emp_df, ifelse(prof %in% c("BusinessDevelopment", "manage", "Consult"), "Management & Leadership",
ifelse(prof %in% c("Engineer", "IT"), "Technical & Engineering",
ifelse(prof %in% c("Sales", "Marketing", "Commercial", "PR"), "Sales, Marketing & Commercial",
ifelse(prof %in% c("Accounting", "Finance", "Law"), "Finance & Legal",
ifelse(prof == "HR", "HR & Administration",
ifelse(prof == "Teaching", "Specialized Roles", "Other")))))))
table(emp_df$prof_group)
##
## Finance & Legal HR & Administration
## 34 757
## Management & Leadership Other
## 74 37
## Sales, Marketing & Commercial Specialized Roles
## 126 12
## Technical & Engineering
## 89
temp <- data.frame(emp_df)
temp$treatment <- ifelse(temp$prof_group %in% c("Technical & Engineering"), 1, 0)
head(temp)
## exp event self_gender age field prof route coach
## 1 7.030801 1 m 35 finance HR recommend no
## 2 22.965092 1 m 33 finance HR jobsite_resume no
## 3 15.934292 1 f 35 others HR recommend no
## 4 15.934292 1 f 35 others HR recommend no
## 5 8.410678 1 m 32 retail Commercial jobsite_apply yes
## 6 8.969199 1 f 42 manufacture HR jobsite_resume yes
## head_gender wage trans score_extro score_idpt score_selfctrl score_anx
## 1 f white bus 6.2 4.1 5.7 7.1
## 2 m white bus 6.2 4.1 5.7 7.1
## 3 m white bus 6.2 6.2 2.6 4.8
## 4 m white bus 5.4 7.6 4.9 2.5
## 5 f white bus 3.0 4.1 8.0 7.1
## 6 m white bus 6.2 6.2 4.1 5.6
## score_innov prof_group treatment
## 1 8.3 HR & Administration 0
## 2 8.3 HR & Administration 0
## 3 8.3 HR & Administration 0
## 4 6.7 HR & Administration 0
## 5 3.7 Sales, Marketing & Commercial 0
## 6 6.7 HR & Administration 0
# Recreate emp_df_copy to avoid modifying the original data
temp_sub <- temp[, c("treatment", "event", "route", "wage", "trans", "score_selfctrl", "score_anx")]
# Fit the two-model uplift regression
uplift_model <- DualUplift(temp_sub, treat = "treatment", outcome = "event",
predictors = c("route", "wage", "trans", "score_selfctrl", "score_anx"))
print(uplift_model)
## $call
## DualUplift(data = temp_sub, treat = "treatment", outcome = "event",
## predictors = c("route", "wage", "trans", "score_selfctrl",
## "score_anx"))
##
## $control
##
## Call: "Control Group Logistic Model"
##
## Coefficients:
## (Intercept) routefriends routejobsite_apply
## 1.03269 -0.49600 -0.24826
## routejobsite_resume routerecommend wagewhite
## 0.11722 0.40538 -0.29556
## transcar transfoot score_selfctrl
## 0.03696 -0.49857 -0.03805
## score_anx
## -0.08704
##
## Degrees of Freedom: 1039 Total (i.e. Null); 1030 Residual
## Null Deviance: 1441
## Residual Deviance: 1401 AIC: 1421
##
## $treatment
##
## Call: "Treatment Group Logistic Model"
##
## Coefficients:
## (Intercept) routefriends routejobsite_apply
## -18.68276 15.03385 16.38026
## routejobsite_resume routerecommend wagewhite
## 16.18549 17.32853 0.40742
## transcar transfoot score_selfctrl
## -0.99190 -2.88253 0.00728
## score_anx
## 0.30835
##
## Degrees of Freedom: 88 Total (i.e. Null); 79 Residual
## Null Deviance: 120.1
## Residual Deviance: 98.93 AIC: 118.9
##
## attr(,"class")
## [1] "print.DualUplift"
summary(uplift_model)
## $call
## DualUplift(data = temp_sub, treat = "treatment", outcome = "event",
## predictors = c("route", "wage", "trans", "score_selfctrl",
## "score_anx"))
##
## $control
##
## Call:
## "Control Group Logistic Model"
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.03269 0.40116 2.574 0.0100 *
## routefriends -0.49600 0.29132 -1.703 0.0886 .
## routejobsite_apply -0.24826 0.24084 -1.031 0.3026
## routejobsite_resume 0.11722 0.24986 0.469 0.6390
## routerecommend 0.40538 0.23685 1.712 0.0870 .
## wagewhite -0.29556 0.20283 -1.457 0.1451
## transcar 0.03696 0.14485 0.255 0.7986
## transfoot -0.49857 0.22227 -2.243 0.0249 *
## score_selfctrl -0.03805 0.03267 -1.165 0.2442
## score_anx -0.08704 0.03792 -2.296 0.0217 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1440.9 on 1039 degrees of freedom
## Residual deviance: 1401.3 on 1030 degrees of freedom
## AIC: 1421.3
##
## Number of Fisher Scoring iterations: 4
##
##
## $treatment
##
## Call:
## "Treatment Group Logistic Model"
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -18.68276 1295.38414 -0.014 0.9885
## routefriends 15.03385 1295.38305 0.012 0.9907
## routejobsite_apply 16.38026 1295.38292 0.013 0.9899
## routejobsite_resume 16.18549 1295.38289 0.012 0.9900
## routerecommend 17.32853 1295.38295 0.013 0.9893
## wagewhite 0.40742 0.83845 0.486 0.6270
## transcar -0.99190 0.55955 -1.773 0.0763 .
## transfoot -2.88253 1.17639 -2.450 0.0143 *
## score_selfctrl 0.00728 0.11667 0.062 0.9502
## score_anx 0.30834 0.16689 1.848 0.0647 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 120.113 on 88 degrees of freedom
## Residual deviance: 98.931 on 79 degrees of freedom
## AIC: 118.93
##
## Number of Fisher Scoring iterations: 15
##
##
## attr(,"class")
## [1] "summary.DualUplift"
Control group:
Employees with higher anxiety are more likely to quit.
Those who walk to work are more likely to quit.
Referral and friendship-based hires show some indication of lower turnover, though not strongly significant.
Treatment group:
Transportation (walking and driving) compared to go by bus seems to have a strong effect on turnover for IT & Engineering professionals.
Anxiety remains a factor in predicting quitting behavior, with marginal significance.
Insights: + Anxiety is a significant factor in both control and treatment groups. Consider interventions like mental health support or stress management programs for employees with high anxiety scores. + In the treatment group, employees who drive to work seem to have lower turnover. You may want to target them for special retention programs or further incentives as they seem less likely to quit. + The control group shows that hiring through friends or recommendations might lead to lower turnover. While the treatment group results for route variables are inconclusive, consider revisiting how IT & Engineering employees are sourced. Referral programs might yield better long-term retention.
chosen_factors = c("field", "prof", "route", "wage", "trans")
unique_val_lst <- list()
for (col in chosen_factors){
unique_val_lst[[col]] = unique(emp_df[col])
}
for (col in names(unique_val_lst)) {
print(unique_val_lst[[col]])
}
## field
## 1 finance
## 3 others
## 5 retail
## 6 manufacture
## 18 IT
## prof
## 1 HR
## 5 Commercial
## 10 Marketing
## 14 etc
## 18 Sales
## 22 BusinessDevelopment
## 23 Finance
## 53 Teaching
## 68 manage
## 109 IT
## 121 Law
## 135 Consult
## 138 Engineer
## 279 PR
## 356 Accounting
## route
## 1 recommend
## 2 jobsite_resume
## 5 jobsite_apply
## 14 others
## 313 friends
## wage
## 1 white
## 19 grey
## trans
## 1 bus
## 11 car
## 13 foot
We hypothesize that the source through which an employee was recruited might affect their retention. For example, employees recruited through referrals may have a stronger bond with the company than those recruited via job sites. Therefore, we use recruitment source with more focusing on referral as a treatment-like variable to target specific interventions to employees from less effective pipelines.
temp$treatment <- ifelse(temp$route %in% c("friends", "recommend"), 1, 0)
head(temp)
## exp event self_gender age field prof route coach
## 1 7.030801 1 m 35 finance HR recommend no
## 2 22.965092 1 m 33 finance HR jobsite_resume no
## 3 15.934292 1 f 35 others HR recommend no
## 4 15.934292 1 f 35 others HR recommend no
## 5 8.410678 1 m 32 retail Commercial jobsite_apply yes
## 6 8.969199 1 f 42 manufacture HR jobsite_resume yes
## head_gender wage trans score_extro score_idpt score_selfctrl score_anx
## 1 f white bus 6.2 4.1 5.7 7.1
## 2 m white bus 6.2 4.1 5.7 7.1
## 3 m white bus 6.2 6.2 2.6 4.8
## 4 m white bus 5.4 7.6 4.9 2.5
## 5 f white bus 3.0 4.1 8.0 7.1
## 6 m white bus 6.2 6.2 4.1 5.6
## score_innov prof_group treatment
## 1 8.3 HR & Administration 1
## 2 8.3 HR & Administration 0
## 3 8.3 HR & Administration 1
## 4 6.7 HR & Administration 1
## 5 3.7 Sales, Marketing & Commercial 0
## 6 6.7 HR & Administration 0
# Recreate emp_df_copy to avoid modifying the original data
temp_sub <- temp[, c("treatment", "event", "prof_group", "wage", "trans", "score_selfctrl", "score_anx")]
# Fit the two-model uplift regression
uplift_model <- DualUplift(temp_sub, treat = "treatment", outcome = "event",
predictors = c("prof_group", "wage", "trans", "score_selfctrl", "score_anx"))
print(uplift_model)
## $call
## DualUplift(data = temp_sub, treat = "treatment", outcome = "event",
## predictors = c("prof_group", "wage", "trans", "score_selfctrl",
## "score_anx"))
##
## $control
##
## Call: "Control Group Logistic Model"
##
## Coefficients:
## (Intercept)
## 0.64920
## prof_groupHR & Administration
## -0.15084
## prof_groupManagement & Leadership
## 0.73692
## prof_groupOther
## 0.07959
## prof_groupSales, Marketing & Commercial
## 0.60613
## prof_groupSpecialized Roles
## 14.67774
## prof_groupTechnical & Engineering
## -0.41039
## wagewhite
## -0.15335
## transcar
## -0.21582
## transfoot
## -0.80037
## score_selfctrl
## -0.02237
## score_anx
## -0.05317
##
## Degrees of Freedom: 665 Total (i.e. Null); 654 Residual
## Null Deviance: 922.3
## Residual Deviance: 893.7 AIC: 917.7
##
## $treatment
##
## Call: "Treatment Group Logistic Model"
##
## Coefficients:
## (Intercept)
## 2.33796
## prof_groupHR & Administration
## -0.97177
## prof_groupManagement & Leadership
## -0.36595
## prof_groupOther
## -0.37652
## prof_groupSales, Marketing & Commercial
## -0.62287
## prof_groupSpecialized Roles
## 14.55454
## prof_groupTechnical & Engineering
## -1.09251
## wagewhite
## -0.43483
## transcar
## 0.25725
## transfoot
## -0.54997
## score_selfctrl
## -0.07255
## score_anx
## -0.10988
##
## Degrees of Freedom: 462 Total (i.e. Null); 451 Residual
## Null Deviance: 638.6
## Residual Deviance: 605.2 AIC: 629.2
##
## attr(,"class")
## [1] "print.DualUplift"
summary(uplift_model)
## $call
## DualUplift(data = temp_sub, treat = "treatment", outcome = "event",
## predictors = c("prof_group", "wage", "trans", "score_selfctrl",
## "score_anx"))
##
## $control
##
## Call:
## "Control Group Logistic Model"
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.64920 0.85353 0.761 0.44690
## prof_groupHR & Administration -0.15084 0.71712 -0.210 0.83340
## prof_groupManagement & Leadership 0.73692 0.80484 0.916 0.35987
## prof_groupOther 0.07959 0.82872 0.096 0.92349
## prof_groupSales, Marketing & Commercial 0.60613 0.75616 0.802 0.42279
## prof_groupSpecialized Roles 14.67774 499.14355 0.029 0.97654
## prof_groupTechnical & Engineering -0.41039 0.77131 -0.532 0.59468
## wagewhite -0.15335 0.23847 -0.643 0.52018
## transcar -0.21582 0.18495 -1.167 0.24324
## transfoot -0.80037 0.28325 -2.826 0.00472
## score_selfctrl -0.02237 0.04030 -0.555 0.57878
## score_anx -0.05317 0.04769 -1.115 0.26488
##
## (Intercept)
## prof_groupHR & Administration
## prof_groupManagement & Leadership
## prof_groupOther
## prof_groupSales, Marketing & Commercial
## prof_groupSpecialized Roles
## prof_groupTechnical & Engineering
## wagewhite
## transcar
## transfoot **
## score_selfctrl
## score_anx
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 922.26 on 665 degrees of freedom
## Residual deviance: 893.65 on 654 degrees of freedom
## AIC: 917.65
##
## Number of Fisher Scoring iterations: 13
##
##
## $treatment
##
## Call:
## "Treatment Group Logistic Model"
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.33796 0.76770 3.045 0.00232
## prof_groupHR & Administration -0.97177 0.46775 -2.078 0.03775
## prof_groupManagement & Leadership -0.36595 0.55536 -0.659 0.50993
## prof_groupOther -0.37652 0.72293 -0.521 0.60250
## prof_groupSales, Marketing & Commercial -0.62287 0.53398 -1.166 0.24343
## prof_groupSpecialized Roles 14.55454 477.12566 0.031 0.97566
## prof_groupTechnical & Engineering -1.09251 0.55299 -1.976 0.04819
## wagewhite -0.43483 0.35997 -1.208 0.22706
## transcar 0.25725 0.21271 1.209 0.22650
## transfoot -0.54997 0.34282 -1.604 0.10866
## score_selfctrl -0.07255 0.04982 -1.456 0.14534
## score_anx -0.10988 0.05867 -1.873 0.06111
##
## (Intercept) **
## prof_groupHR & Administration *
## prof_groupManagement & Leadership
## prof_groupOther
## prof_groupSales, Marketing & Commercial
## prof_groupSpecialized Roles
## prof_groupTechnical & Engineering *
## wagewhite
## transcar
## transfoot
## score_selfctrl
## score_anx .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 638.57 on 462 degrees of freedom
## Residual deviance: 605.23 on 451 degrees of freedom
## AIC: 629.23
##
## Number of Fisher Scoring iterations: 14
##
##
## attr(,"class")
## [1] "summary.DualUplift"
Control Group:
Walking to work is the most significant predictor of turnover in the control group.
Treatment Group:
Referral-based hires in HR & Administration and Technical & Engineering roles are less likely to quit, suggesting that referrals work well for these professions.
Anxiety may increase the likelihood of quitting, even for referred employees. Insights:
The model shows that HR & Administration and Technical & Engineering employees who were referred are less likely to quit. This suggests that referral programs are particularly effective for these roles. Strengthen and promote referral incentives for these professions to further enhance retention.
For other professions like Sales & Marketing or Management, the referral system may not be as effective
Walking to work is consistently associated with higher turnover in both the control and treatment groups
Anxiety is a marginally significant predictor of quitting, particularly in the treatment group.
We think that employees who use more convenient or subsidized transportation may have higher retention due to reduced commuting stress. Thus, we want to assess the impact of interventions like offering transporation benefits on employees with different commuting modes.
table(emp_df$trans)
##
## bus car foot
## 681 331 117
temp$treatment <- ifelse(temp$trans %in% c("bus"), 1, 0)
head(temp)
## exp event self_gender age field prof route coach
## 1 7.030801 1 m 35 finance HR recommend no
## 2 22.965092 1 m 33 finance HR jobsite_resume no
## 3 15.934292 1 f 35 others HR recommend no
## 4 15.934292 1 f 35 others HR recommend no
## 5 8.410678 1 m 32 retail Commercial jobsite_apply yes
## 6 8.969199 1 f 42 manufacture HR jobsite_resume yes
## head_gender wage trans score_extro score_idpt score_selfctrl score_anx
## 1 f white bus 6.2 4.1 5.7 7.1
## 2 m white bus 6.2 4.1 5.7 7.1
## 3 m white bus 6.2 6.2 2.6 4.8
## 4 m white bus 5.4 7.6 4.9 2.5
## 5 f white bus 3.0 4.1 8.0 7.1
## 6 m white bus 6.2 6.2 4.1 5.6
## score_innov prof_group treatment
## 1 8.3 HR & Administration 1
## 2 8.3 HR & Administration 1
## 3 8.3 HR & Administration 1
## 4 6.7 HR & Administration 1
## 5 3.7 Sales, Marketing & Commercial 1
## 6 6.7 HR & Administration 1
# Recreate emp_df_copy to avoid modifying the original data
temp_sub <- temp[, c("treatment", "event", "prof_group", "route", "wage", "score_selfctrl", "score_anx")]
# Fit the two-model uplift regression
uplift_model <- DualUplift(temp_sub, treat = "treatment", outcome = "event",
predictors = c("prof_group", "route", "wage", "score_selfctrl", "score_anx"))
print(uplift_model)
## $call
## DualUplift(data = temp_sub, treat = "treatment", outcome = "event",
## predictors = c("prof_group", "route", "wage", "score_selfctrl",
## "score_anx"))
##
## $control
##
## Call: "Control Group Logistic Model"
##
## Coefficients:
## (Intercept)
## 1.90918
## prof_groupHR & Administration
## -0.54507
## prof_groupManagement & Leadership
## -0.11562
## prof_groupOther
## -0.32317
## prof_groupSales, Marketing & Commercial
## -0.48353
## prof_groupSpecialized Roles
## 13.47259
## prof_groupTechnical & Engineering
## -1.41488
## routefriends
## 0.16616
## routejobsite_apply
## -0.09732
## routejobsite_resume
## 0.08140
## routerecommend
## 0.49836
## wagewhite
## -0.04847
## score_selfctrl
## -0.14453
## score_anx
## -0.13492
##
## Degrees of Freedom: 447 Total (i.e. Null); 434 Residual
## Null Deviance: 620.6
## Residual Deviance: 586.1 AIC: 614.1
##
## $treatment
##
## Call: "Treatment Group Logistic Model"
##
## Coefficients:
## (Intercept)
## 1.10950
## prof_groupHR & Administration
## -1.05165
## prof_groupManagement & Leadership
## -0.01037
## prof_groupOther
## -0.79085
## prof_groupSales, Marketing & Commercial
## -0.01305
## prof_groupSpecialized Roles
## 14.35006
## prof_groupTechnical & Engineering
## -0.85172
## routefriends
## -0.75990
## routejobsite_apply
## 0.08910
## routejobsite_resume
## 0.52726
## routerecommend
## 0.62512
## wagewhite
## -0.34236
## score_selfctrl
## 0.02366
## score_anx
## -0.05141
##
## Degrees of Freedom: 680 Total (i.e. Null); 667 Residual
## Null Deviance: 943
## Residual Deviance: 880 AIC: 908
##
## attr(,"class")
## [1] "print.DualUplift"
summary(uplift_model)
## $call
## DualUplift(data = temp_sub, treat = "treatment", outcome = "event",
## predictors = c("prof_group", "route", "wage", "score_selfctrl",
## "score_anx"))
##
## $control
##
## Call:
## "Control Group Logistic Model"
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.90918 0.83800 2.278 0.02271
## prof_groupHR & Administration -0.54507 0.54197 -1.006 0.31455
## prof_groupManagement & Leadership -0.11562 0.64193 -0.180 0.85707
## prof_groupOther -0.32317 0.72894 -0.443 0.65752
## prof_groupSales, Marketing & Commercial -0.48353 0.59222 -0.816 0.41423
## prof_groupSpecialized Roles 13.47259 508.16307 0.027 0.97885
## prof_groupTechnical & Engineering -1.41488 0.64658 -2.188 0.02865
## routefriends 0.16616 0.41949 0.396 0.69204
## routejobsite_apply -0.09732 0.34785 -0.280 0.77966
## routejobsite_resume 0.08140 0.36697 0.222 0.82446
## routerecommend 0.49836 0.31714 1.571 0.11609
## wagewhite -0.04847 0.33670 -0.144 0.88553
## score_selfctrl -0.14453 0.05171 -2.795 0.00519
## score_anx -0.13492 0.06301 -2.141 0.03226
##
## (Intercept) *
## prof_groupHR & Administration
## prof_groupManagement & Leadership
## prof_groupOther
## prof_groupSales, Marketing & Commercial
## prof_groupSpecialized Roles
## prof_groupTechnical & Engineering *
## routefriends
## routejobsite_apply
## routejobsite_resume
## routerecommend
## wagewhite
## score_selfctrl **
## score_anx *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 620.62 on 447 degrees of freedom
## Residual deviance: 586.09 on 434 degrees of freedom
## AIC: 614.09
##
## Number of Fisher Scoring iterations: 13
##
##
## $treatment
##
## Call:
## "Treatment Group Logistic Model"
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.10950 0.78923 1.406 0.1598
## prof_groupHR & Administration -1.05165 0.56079 -1.875 0.0608 .
## prof_groupManagement & Leadership -0.01037 0.64594 -0.016 0.9872
## prof_groupOther -0.79085 0.71793 -1.102 0.2706
## prof_groupSales, Marketing & Commercial -0.01305 0.62066 -0.021 0.9832
## prof_groupSpecialized Roles 14.35006 473.31335 0.030 0.9758
## prof_groupTechnical & Engineering -0.85172 0.62210 -1.369 0.1710
## routefriends -0.75990 0.43469 -1.748 0.0804 .
## routejobsite_apply 0.08910 0.37160 0.240 0.8105
## routejobsite_resume 0.52726 0.38035 1.386 0.1657
## routerecommend 0.62512 0.37808 1.653 0.0982 .
## wagewhite -0.34236 0.25427 -1.346 0.1782
## score_selfctrl 0.02366 0.04129 0.573 0.5666
## score_anx -0.05141 0.04780 -1.076 0.2821
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 943.00 on 680 degrees of freedom
## Residual deviance: 879.98 on 667 degrees of freedom
## AIC: 907.98
##
## Number of Fisher Scoring iterations: 14
##
##
## attr(,"class")
## [1] "summary.DualUplift"
Insights: - HR & Administration employees commuting by bus are less likely to quit, indicating stronger job satisfaction or better alignment with organizational values. - Employees with higher self-control and anxiety are less likely to quit among non-bus commuters. - For non-bus commuters, Technical & Engineering employees are less likely to quit. This group may have higher job satisfaction due to better alignment with role expectations. - Bus commuters, particularly those in HR & Administration, appear more stable. Consider expanding transportation-related support (e.g., subsidized bus passes or carpool options) to encourage bus use, as it may correlate with lower turnover.
Based on comprehensive analysis using various survival models (Kaplan-Meier, AFT, Cox Proportional Hazards, and Uplift models), several key factors influence employee retention, offering insights into potential strategies to reduce turnover.