Introduction

In today’s fast-paced work environments, retaining employees has become a top priority for organizations worldwide. When a team member leaves the company, this not only causes a disruption in workflows, but also leads to an increase in terms of costs that need to be spent on recruitment, training as well as necessary time to get along with other team members that decrease the overall productivity. As there’s a rising demand on identifying and tackling this challenge, many companies are increasingly turning to sophisticated analytical methods to uncover the underlying reasons behind employee turnover and allows for the development of predictive models to anticipate future departures. In this project, I will focus on using Survival Analysis for this problem with its advantages in modeling the employee’s risk of quitting from the companies based on a real dataset provided by Edward Babushkin, who examines the importance of this method instead of using logistic regression for risk assessment.

Data Information

Data Overview

The dataset contains the following variables:

stag: Employee experience (in years) will be renamed as experience (exp) in the analysis
event: Employee turnover (1 = turnover, 0 = no turnover)
gender: Employee’s gender
- f: Female
- m: Male
age: Employee’s age (in years)
industry: Employee’s industry sector
profession: Employee’s profession
traffic: The source or pipeline through which the employee came to the company:
- advert: Employee contacted the company directly after learning about it (through ads, company branding, etc.)
- recNErab: Contacted the company on the recommendation of a non-employee friend
- referal: Contacted the company on the recommendation of an employee friend
- youjs: Applied for a vacancy on a job site
- KA: Recruited through a recruitment agency
- friends: Invited by the employer, with prior acquaintance
- rabrecNErab: Employer contacted the employee on the recommendation of a non-employee acquaintance
- empjs: Employer found the employee through their resume on a job site
coach: Presence of a coach or training during probation (1 = yes, 0 = no)
head_gender: Gender of the employee’s supervisor (head)
greywage: Whether the salary is not reported to tax authorities (grey wage system)
way: Employee’s mode of transportation to work
extraversion: Employee’s extraversion score
independ: Employee’s independence score
selfcontrol: Employee’s self-control score
anxiety: Employee’s anxiety score
novator: Employee’s innovation (novator) score

# Read the data
emp_df <- read.csv("employee/turnover.csv")

# Show the data's structure
str(emp_df)

## 'data.frame':    1129 obs. of  16 variables:
##  $ stag        : num  7.03 22.97 15.93 15.93 8.41 ...
##  $ event       : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ gender      : chr  "m" "m" "f" "f" ...
##  $ age         : num  35 33 35 35 32 42 42 28 29 30 ...
##  $ industry    : chr  "Banks" "Banks" "PowerGeneration" "PowerGeneration" ...
##  $ profession  : chr  "HR" "HR" "HR" "HR" ...
##  $ traffic     : chr  "rabrecNErab" "empjs" "rabrecNErab" "rabrecNErab" ...
##  $ coach       : chr  "no" "no" "no" "no" ...
##  $ head_gender : chr  "f" "m" "m" "m" ...
##  $ greywage    : chr  "white" "white" "white" "white" ...
##  $ way         : chr  "bus" "bus" "bus" "bus" ...
##  $ extraversion: num  6.2 6.2 6.2 5.4 3 6.2 6.2 3.8 8.6 5.4 ...
##  $ independ    : num  4.1 4.1 6.2 7.6 4.1 6.2 6.2 5.5 6.9 5.5 ...
##  $ selfcontrol : num  5.7 5.7 2.6 4.9 8 4.1 4.1 8 2.6 3.3 ...
##  $ anxiety     : num  7.1 7.1 4.8 2.5 7.1 5.6 5.6 4 4 7.9 ...
##  $ novator     : num  8.3 8.3 8.3 6.7 3.7 6.7 6.7 4.4 7.5 8.3 ...

# Sample the first 6 rows of the data
head(emp_df)

##        stag event gender age        industry profession     traffic coach
## 1  7.030801     1      m  35           Banks         HR rabrecNErab    no
## 2 22.965092     1      m  33           Banks         HR       empjs    no
## 3 15.934292     1      f  35 PowerGeneration         HR rabrecNErab    no
## 4 15.934292     1      f  35 PowerGeneration         HR rabrecNErab    no
## 5  8.410678     1      m  32          Retail Commercial       youjs   yes
## 6  8.969199     1      f  42     manufacture         HR       empjs   yes
##   head_gender greywage way extraversion independ selfcontrol anxiety novator
## 1           f    white bus          6.2      4.1         5.7     7.1     8.3
## 2           m    white bus          6.2      4.1         5.7     7.1     8.3
## 3           m    white bus          6.2      6.2         2.6     4.8     8.3
## 4           m    white bus          5.4      7.6         4.9     2.5     6.7
## 5           f    white bus          3.0      4.1         8.0     7.1     3.7
## 6           m    white bus          6.2      6.2         4.1     5.6     6.7

Data Pre-processing

# Rename the columns for better representation
colnames(emp_df) <- c("exp","event","self_gender","age","field","prof","route","coach","head_gender","wage","trans","score_extro","score_idpt","score_selfctrl","score_anx","score_innov")

# Categorize feature based on their data types
cat_features <- c("self_gender", "field", "prof", "route", "coach", "head_gender", "wage", "trans")
num_features <- colnames(emp_df)[!c(colnames(emp_df) %in% cat_features)]

num_features

## [1] "exp"            "event"          "age"            "score_extro"   
## [5] "score_idpt"     "score_selfctrl" "score_anx"      "score_innov"

First, we can investigate the unique values of each categorical variable.

# Unique values of each categorical features
for (col in cat_features){
  print(table(emp_df[col]))
}

## self_gender
##   f   m 
## 853 276 
## field
##          HoReCa     Agriculture           Banks        Building         Consult 
##              11              15             114              41              74 
##             etc              IT     manufacture          Mining          Pharma 
##              94             122             145              24              20 
## PowerGeneration      RealEstate          Retail           State         Telecom 
##              38              13             289              55              36 
##       transport 
##              38 
## prof
##          Finan\xf1e          Accounting BusinessDevelopment          Commercial 
##                  17                  10                  27                  23 
##             Consult            Engineer                 etc                  HR 
##                  25                  15                  37                 757 
##                  IT                 Law              manage           Marketing 
##                  74                   7                  22                  31 
##                  PR               Sales            Teaching 
##                   6                  66                  12 
## route
##      advert       empjs     friends          KA rabrecNErab    recNErab 
##          33         248         118          67         211          39 
##     referal       youjs 
##          95         318 
## coach
## my head      no     yes 
##     314     683     132 
## head_gender
##   f   m 
## 545 584 
## wage
##  grey white 
##   127  1002 
## trans
##  bus  car foot 
##  681  331  117

As we can see from a sneek peak of the dataset, during the data collection process, some typographical errors were introduced (e.g., Finance is recorded as Finan1e), which could potentially complicate the analysis and interpretation. To ensure accurate results and clear insights, it is necessary to address these inconsistencies. Additionally, we can refine and re-categorize certain values for easier interpretation based on the provided assumptions. Below are some suggested re-factorings for the categorical variables:

emp_df <- emp_df %>%
  mutate(prof = if_else(prof == "Finan\xf1e", "Finance", prof))

emp_df %>% mutate(field = factor(case_match(
  field,
  .default="others",
  "Retail" ~ "retail",
  "manufacture" ~ "manufacture",
  "IT" ~ "IT",
  "Banks" ~ "finance"))) -> emp_df

emp_df %>% mutate(route = factor(case_match(
  route,
  .default="others",
  "empjs" ~ "jobsite_resume",
  "friends" ~ "friends",
  c("referal", "rabrecNErab", "recNErab") ~ "recommend",
  "youjs" ~ "jobsite_apply"))) -> emp_df

for (col in cat_features){
  print(table(emp_df[col]))
}

## self_gender
##   f   m 
## 853 276 
## field
##     finance          IT manufacture      others      retail 
##         114         122         145         459         289 
## prof
##          Accounting BusinessDevelopment          Commercial             Consult 
##                  10                  27                  23                  25 
##            Engineer                 etc             Finance                  HR 
##                  15                  37                  17                 757 
##                  IT                 Law              manage           Marketing 
##                  74                   7                  22                  31 
##                  PR               Sales            Teaching 
##                   6                  66                  12 
## route
##        friends  jobsite_apply jobsite_resume         others      recommend 
##            118            318            248            100            345 
## coach
## my head      no     yes 
##     314     683     132 
## head_gender
##   f   m 
## 545 584 
## wage
##  grey white 
##   127  1002 
## trans
##  bus  car foot 
##  681  331  117

# Convert all categorical features to factor data type
for (col in cat_features) {
  emp_df[[col]] <- as.factor(emp_df[[col]])
}

## Change appropriate reference group:
emp_df$field <- relevel(emp_df$field, ref = "others")
emp_df$route <- relevel(emp_df$route, ref = "others")
emp_df$coach <- relevel(emp_df$coach, ref = "no")
emp_df$wage <- relevel(emp_df$wage, ref = "grey")
emp_df$trans <- relevel(emp_df$trans, ref = "bus")

Exploratory Data Analysis

In this section, we will explore the dataset to gain initial insights and identify patterns, trends, and potential anomalies. Through descriptive statistics and visualizations, we will highlight key features and prepare the dataset for more advanced analysis.

# Summary statistics for numerical features
summary(emp_df[num_features])

##       exp               event             age         score_extro    
##  Min.   :  0.3942   Min.   :0.0000   Min.   :18.00   Min.   : 1.000  
##  1st Qu.: 11.7289   1st Qu.:0.0000   1st Qu.:26.00   1st Qu.: 4.600  
##  Median : 24.3450   Median :1.0000   Median :30.00   Median : 5.400  
##  Mean   : 36.6275   Mean   :0.5058   Mean   :31.07   Mean   : 5.592  
##  3rd Qu.: 51.3183   3rd Qu.:1.0000   3rd Qu.:36.00   3rd Qu.: 7.000  
##  Max.   :179.4497   Max.   :1.0000   Max.   :58.00   Max.   :10.000  
##    score_idpt     score_selfctrl     score_anx       score_innov   
##  Min.   : 1.000   Min.   : 1.000   Min.   : 1.700   Min.   : 1.00  
##  1st Qu.: 4.100   1st Qu.: 4.100   1st Qu.: 4.800   1st Qu.: 4.40  
##  Median : 5.500   Median : 5.700   Median : 5.600   Median : 6.00  
##  Mean   : 5.478   Mean   : 5.597   Mean   : 5.666   Mean   : 5.88  
##  3rd Qu.: 6.900   3rd Qu.: 7.200   3rd Qu.: 7.100   3rd Qu.: 7.50  
##  Max.   :10.000   Max.   :10.000   Max.   :10.000   Max.   :10.00

# Set up the plot window to display multiple plots
par(mfrow = c(2, 2))

for (column in num_features) {
  hist(emp_df[[column]], main = paste("Histogram of", column), xlab = column, col = "lightblue")
}

We will further look into the distribution of response variable Experience (exp) for non-censored as well censored data:

par(mfrow=c(2,1))
hist(as.matrix(emp_df %>% filter(event==1) %>%  pull(exp)), xlab="Experience", main="Non-censored data",col = "lightblue")

hist(as.matrix(emp_df %>% filter(event==0) %>% pull(exp)), xlab="Experience", main="Censored data", col = "lightblue")

As we can observe the non-normal skewness from above histogram of the exp (years of experience) for both the uncensored and censored data, it would be reasonable to apply survival analysis for this problem

long_df <- pivot_longer(emp_df, cols =c(self_gender, field, prof, route, coach, head_gender, wage, trans), names_to = "Category", values_to = "Value")
cat_factors <- c("self_gender", "field", "prof", "route", "coach", "head_gender", "wage", "trans")

# Loop through each categorical variable and create a plot
plots <- lapply(cat_factors, function(cat) {
  ggplot(data = filter(long_df, Category == cat), aes(x = Value, fill = Value)) +
    geom_bar() +
    facet_wrap(~event, nrow = 2) + 
    labs(title = paste("Distribution of", cat, "by Event (censored or not)"), x = cat, y = "Count") +
    theme_minimal() +
    theme(axis.text.x = element_text(angle = 45, hjust = 1))
})

plots

## [[1]]

## 
## [[2]]

## 
## [[3]]

## 
## [[4]]

## 
## [[5]]

## 
## [[6]]

## 
## [[7]]

## 
## [[8]]

Non-parametric Methods (Kaplan-Meier):

First, we will investigate whether different groups within the dataset exhibit variations in retention rates (also referred to as survival rates in the context of survival analysis). To do this, we will calculate Kaplan-Meier estimates for each group, enabling a comparison of their retention rates. Additionally, we will conduct a log-rank test to determine whether the observed differences between the groups are statistically significant with predetermined significance level of 0.05.

mod <- survfit(Surv(exp, event)~self_gender, data = emp_df)
plot1 <- ggsurvplot(mod, data = emp_df,
                     conf.int = FALSE,
                     ggtheme = theme_minimal(),
                     pval = TRUE,
                     pval.method = TRUE) +
          ggtitle("Retention curve by Employee's Gender")

mod <- survfit(Surv(exp, event)~field, data = emp_df)
plot2 <- ggsurvplot(mod, data = emp_df,
                     conf.int = FALSE,
                     ggtheme = theme_minimal(),
                     pval = TRUE,
                     pval.method = TRUE) +
          ggtitle("Retention curve by Employee's Industry sector")


mod <- survfit(Surv(exp, event)~prof, data = emp_df)
plot3 <- ggsurvplot(mod, data = emp_df,
                     conf.int = FALSE,
                     ggtheme = theme_minimal(),
                     pval = TRUE,
                     pval.method = TRUE) +
          ggtitle("Retention curve by Employee's Profession")

mod <- survfit(Surv(exp, event)~route, data = emp_df)
plot4 <- ggsurvplot(mod, data = emp_df,
                     conf.int = FALSE,
                     ggtheme = theme_minimal(),
                     pval = TRUE,
                     pval.method = TRUE) +
          ggtitle("Retention curve by Employee's Source came to the Company")

mod <- survfit(Surv(exp, event)~coach, data = emp_df)
plot5 <- ggsurvplot(mod, data = emp_df,
                     conf.int = FALSE,
                     ggtheme = theme_minimal(),
                     pval = TRUE,
                     pval.method = TRUE) +
          ggtitle("Retention curve by Employee's Coaching/Training availability")

mod <- survfit(Surv(exp, event)~head_gender, data = emp_df)
plot6 <- ggsurvplot(mod, data = emp_df,
                     conf.int = FALSE,
                     ggtheme = theme_minimal(),
                     pval = TRUE,
                     pval.method = TRUE) +
          ggtitle("Retention curve by Gender of the employee's supervisor")

mod <- survfit(Surv(exp, event)~wage, data = emp_df)
plot7 <- ggsurvplot(mod, data = emp_df,
                     conf.int = FALSE,
                     ggtheme = theme_minimal(),
                     pval = TRUE,
                     pval.method = TRUE) +
          ggtitle("Retention curve by Employee's Wage (Tax reported or not)")


mod <- survfit(Surv(exp, event)~trans, data = emp_df)
plot8 <- ggsurvplot(mod, data = emp_df,
                     conf.int = FALSE,
                     ggtheme = theme_minimal(),
                     pval = TRUE,
                     pval.method = TRUE) +
          ggtitle("Retention curve by Employee's mode of Transportation")

plot1$plot; plot2$plot; plot3$plot; plot4$plot;

plot5$plot; plot6$plot; plot7$plot; plot8$plot

Based on the following results, we can confirm that there’s difference in terms of retention rate for the following variable: - Industry Sector (field) - Profession (prof) - Source came to Company (route) - Wage tax authorized or not (wage) - Mode of Transportation (trans)

Data Modelling

Parametric Survival Model

Find Parametric Distribution

We need to determine which distribution the Experience (exp) follows before applying parametric method such as Accelerated Failure Time (AFT) model.We will examine two popular assumed form of Experience (exp) Weibull, Exponential distribution.

library(goftest)
fitdistr(emp_df$exp, "weibull")

##       shape         scale   
##    1.08272153   37.78782884 
##  ( 0.02503311) ( 1.09608991)

weibullCDF <- function(x, shape, scale) {
  1 - exp(- (x / scale)^shape)
}
# Estimate parameters if not known
shape_hat <- 1.08272153  
scale_hat <- 37.78782884

# Perform the Anderson-Darling test for Weibull
result <- ad.test(emp_df$exp, function(x) weibullCDF(x, shape = shape_hat, scale = scale_hat))

# Print the test result
print(result)

## 
##  Anderson-Darling test of goodness-of-fit
##  Null hypothesis: distribution 'function(x) weibullCDF(x, shape =
##  shape_hat, scale = scale_hat)'
##  Parameters assumed to be fixed
## 
## data:  emp_df$exp
## An = 1.3391, p-value = 0.22

# Fit exponential model
exp_model <- survreg(Surv(exp, event) ~ 1, data = emp_df, dist = 'exponential')
lambda_hat <- 1 / exp_model$scale  # Rate parameter

# Anderson-Darling test for exponentiality
result <- ad.test(emp_df$exp, "pexp", rate = lambda_hat)

# Print the result
print(result)

## 
##  Anderson-Darling test of goodness-of-fit
##  Null hypothesis: exponential distribution
##  with parameter rate = 1
##  Parameters assumed to be fixed
## 
## data:  emp_df$exp
## An = Inf, p-value = 5.314e-07

After fitting the exponential model with a scale parameter of 1, we obtain a p-value of less than 0.05, leading us to reject the null hypothesis that the exponential distribution with a rate parameter of 1 is appropriate for this model. Next, we test the Weibull distribution, using the estimated scale and shape parameters derived from the exp variable in the dataset. The p-value from the Anderson-Darling Goodness-of-Fit test is 0.22, indicating that we fail to reject the null hypothesis. Thus, we proceed with the Weibull distribution as the assumption for the AFT model as below.

We will define the base model that included all predictors, then perform step-wise regression for finding the best combinations of predictors (based on the criteria of lowest AIC)

base_mod<-survreg(Surv(exp, event)~.,data = emp_df, dist = "weibull")
step(base_mod)

## Start:  AIC=5956.48
## Surv(exp, event) ~ self_gender + age + field + prof + route + 
##     coach + head_gender + wage + trans + score_extro + score_idpt + 
##     score_selfctrl + score_anx + score_innov
## 
##                  Df    AIC
## - score_innov     1 5954.5
## - self_gender     1 5954.7
## - score_idpt      1 5954.7
## - score_extro     1 5955.0
## - head_gender     1 5955.4
## - coach           2 5955.9
## <none>              5956.5
## - score_selfctrl  1 5957.0
## - prof           14 5957.7
## - score_anx       1 5957.8
## - trans           2 5961.3
## - age             1 5965.0
## - wage            1 5966.6
## - field           4 5979.1
## - route           4 5981.4
## 
## Step:  AIC=5954.5
## Surv(exp, event) ~ self_gender + age + field + prof + route + 
##     coach + head_gender + wage + trans + score_extro + score_idpt + 
##     score_selfctrl + score_anx
## 
##                  Df    AIC
## - self_gender     1 5952.7
## - score_idpt      1 5952.8
## - score_extro     1 5953.0
## - head_gender     1 5953.4
## - coach           2 5953.9
## <none>              5954.5
## - score_selfctrl  1 5955.4
## - prof           14 5955.7
## - score_anx       1 5956.0
## - trans           2 5959.3
## - age             1 5963.0
## - wage            1 5964.6
## - field           4 5977.3
## - route           4 5979.5
## 
## Step:  AIC=5952.68
## Surv(exp, event) ~ age + field + prof + route + coach + head_gender + 
##     wage + trans + score_extro + score_idpt + score_selfctrl + 
##     score_anx
## 
##                  Df    AIC
## - score_idpt      1 5951.0
## - score_extro     1 5951.1
## - head_gender     1 5951.5
## - coach           2 5952.0
## <none>              5952.7
## - score_selfctrl  1 5953.8
## - prof           14 5954.2
## - score_anx       1 5954.8
## - trans           2 5957.6
## - age             1 5961.2
## - wage            1 5963.0
## - field           4 5976.1
## - route           4 5977.6
## 
## Step:  AIC=5950.97
## Surv(exp, event) ~ age + field + prof + route + coach + head_gender + 
##     wage + trans + score_extro + score_selfctrl + score_anx
## 
##                  Df    AIC
## - head_gender     1 5949.8
## - score_extro     1 5950.1
## - coach           2 5950.4
## <none>              5951.0
## - score_selfctrl  1 5952.1
## - prof           14 5952.2
## - score_anx       1 5953.4
## - trans           2 5955.9
## - age             1 5959.3
## - wage            1 5961.5
## - field           4 5974.3
## - route           4 5975.9
## 
## Step:  AIC=5949.83
## Surv(exp, event) ~ age + field + prof + route + coach + wage + 
##     trans + score_extro + score_selfctrl + score_anx
## 
##                  Df    AIC
## - score_extro     1 5949.0
## - coach           2 5949.4
## <none>              5949.8
## - score_selfctrl  1 5951.0
## - prof           14 5951.9
## - score_anx       1 5952.7
## - trans           2 5955.0
## - age             1 5959.8
## - wage            1 5960.4
## - field           4 5972.7
## - route           4 5974.0
## 
## Step:  AIC=5949.01
## Surv(exp, event) ~ age + field + prof + route + coach + wage + 
##     trans + score_selfctrl + score_anx
## 
##                  Df    AIC
## - coach           2 5948.8
## <none>              5949.0
## - prof           14 5951.1
## - score_anx       1 5953.3
## - trans           2 5954.4
## - score_selfctrl  1 5955.3
## - age             1 5958.1
## - wage            1 5959.4
## - field           4 5972.0
## - route           4 5974.1
## 
## Step:  AIC=5948.76
## Surv(exp, event) ~ age + field + prof + route + wage + trans + 
##     score_selfctrl + score_anx
## 
##                  Df    AIC
## <none>              5948.8
## - prof           14 5951.4
## - score_anx       1 5952.8
## - trans           2 5953.6
## - score_selfctrl  1 5955.6
## - age             1 5959.3
## - wage            1 5959.5
## - field           4 5971.9
## - route           4 5972.6

## Call:
## survreg(formula = Surv(exp, event) ~ age + field + prof + route + 
##     wage + trans + score_selfctrl + score_anx, data = emp_df, 
##     dist = "weibull")
## 
## Coefficients:
##             (Intercept)                     age            fieldfinance 
##             4.290954476            -0.019667910            -0.331611466 
##                 fieldIT        fieldmanufacture             fieldretail 
##             0.460216178             0.133770055             0.294066748 
## profBusinessDevelopment          profCommercial             profConsult 
##            -0.331475026            -0.816549442            -0.746764123 
##            profEngineer                 profetc             profFinance 
##            -0.771198398            -0.477817033            -0.259299256 
##                  profHR                  profIT                 profLaw 
##            -0.256849028            -0.025093874            -0.447873317 
##              profmanage           profMarketing                  profPR 
##            -1.096702978            -0.734964234            -0.755396244 
##               profSales            profTeaching            routefriends 
##            -0.503692717            -0.460266310             0.005788691 
##      routejobsite_apply     routejobsite_resume          routerecommend 
##            -0.403216519            -0.652138777            -0.191406994 
##               wagewhite                transcar               transfoot 
##             0.424160983             0.164941113             0.362512604 
##          score_selfctrl               score_anx 
##             0.056083991             0.053587423 
## 
## Scale= 0.8594605 
## 
## Loglik(model)= -2944.4   Loglik(intercept only)= -3014.3
##  Chisq= 139.79 on 28 degrees of freedom, p= <2e-16 
## n= 1129

best_mod <- survreg(formula = Surv(exp, event) ~ age + field + prof + route + 
    wage + trans + score_selfctrl + score_anx, data = emp_df, 
    dist = "weibull")
summary(best_mod)

## 
## Call:
## survreg(formula = Surv(exp, event) ~ age + field + prof + route + 
##     wage + trans + score_selfctrl + score_anx, data = emp_df, 
##     dist = "weibull")
##                            Value Std. Error     z       p
## (Intercept)              4.29095    0.46798  9.17 < 2e-16
## age                     -0.01967    0.00544 -3.61 0.00030
## fieldfinance            -0.33161    0.11902 -2.79 0.00533
## fieldIT                  0.46022    0.16580  2.78 0.00551
## fieldmanufacture         0.13377    0.12396  1.08 0.28054
## fieldretail              0.29407    0.09774  3.01 0.00262
## profBusinessDevelopment -0.33148    0.42033 -0.79 0.43034
## profCommercial          -0.81655    0.42315 -1.93 0.05365
## profConsult             -0.74676    0.42241 -1.77 0.07708
## profEngineer            -0.77120    0.44142 -1.75 0.08062
## profetc                 -0.47782    0.40574 -1.18 0.23894
## profFinance             -0.25930    0.43531 -0.60 0.55140
## profHR                  -0.25685    0.35998 -0.71 0.47553
## profIT                  -0.02509    0.40052 -0.06 0.95004
## profLaw                 -0.44787    0.52706 -0.85 0.39546
## profmanage              -1.09670    0.41998 -2.61 0.00902
## profMarketing           -0.73496    0.40380 -1.82 0.06874
## profPR                  -0.75540    0.53079 -1.42 0.15470
## profSales               -0.50369    0.38679 -1.30 0.19284
## profTeaching            -0.46027    0.44598 -1.03 0.30206
## routefriends             0.00579    0.18479  0.03 0.97501
## routejobsite_apply      -0.40322    0.14682 -2.75 0.00603
## routejobsite_resume     -0.65214    0.14948 -4.36 1.3e-05
## routerecommend          -0.19141    0.13977 -1.37 0.17086
## wagewhite                0.42416    0.11255  3.77 0.00016
## transcar                 0.16494    0.08429  1.96 0.05037
## transfoot                0.36251    0.14416  2.51 0.01192
## score_selfctrl           0.05608    0.01881  2.98 0.00287
## score_anx                0.05359    0.02189  2.45 0.01436
## Log(scale)              -0.15145    0.03298 -4.59 4.4e-06
## 
## Scale= 0.859 
## 
## Weibull distribution
## Loglik(model)= -2944.4   Loglik(intercept only)= -3014.3
##  Chisq= 139.79 on 28 degrees of freedom, p= 8.2e-17 
## Number of Newton-Raphson Iterations: 7 
## n= 1129

We can see some interesting insights based on the significant variables, the following would show all hazard ratios for each estimate obtained from the model and some interpretations of each significant factor based on the results, controlling other factors remain the same.

haz_rate <- exp(best_mod$coefficients)
haz_rate

##             (Intercept)                     age            fieldfinance 
##              73.0361465               0.9805242               0.7177661 
##                 fieldIT        fieldmanufacture             fieldretail 
##               1.5844165               1.1431299               1.3418735 
## profBusinessDevelopment          profCommercial             profConsult 
##               0.7178641               0.4419540               0.4738975 
##            profEngineer                 profetc             profFinance 
##               0.4624585               0.6201357               0.7715921 
##                  profHR                  profIT                 profLaw 
##               0.7734850               0.9752184               0.6389856 
##              profmanage           profMarketing                  profPR 
##               0.3339704               0.4795226               0.4698244 
##               profSales            profTeaching            routefriends 
##               0.6042950               0.6311156               1.0058055 
##      routejobsite_apply     routejobsite_resume          routerecommend 
##               0.6681674               0.5209304               0.8257964 
##               wagewhite                transcar               transfoot 
##               1.5283076               1.1793237               1.4369353 
##          score_selfctrl               score_anx 
##               1.0576865               1.0550492

Individuals working in the IT field tend to stay at their companies 2.2 times longer than those in finance. Similarly, individuals who apply for jobs through job sites or resumes generally have shorter tenures compared to those who secure positions through referrals from friends. Additionally, employees receiving fully taxed wages (white wages) tend to remain at their companies 52% longer than those receiving untaxed (gray) wages, indicating a longer duration of employment. Higher self-control scores are also associated with longer tenure. Interestingly, individuals who walk to work tend to stay 43.7% longer at their companies compared to those who primarily commute by bus.

Cox Proportional Hazards model (semi-parametric)

In order to derive meaningful conclusions about employee retention with a robust handling of censored data that can serve as a predictive model, we can perform Cox proportional Hazard model (semi-parametric model).

cox_base_mod <- coxph(Surv(exp, event)~.,data = emp_df)
step(cox_base_mod)

## Start:  AIC=6871.14
## Surv(exp, event) ~ self_gender + age + field + prof + route + 
##     coach + head_gender + wage + trans + score_extro + score_idpt + 
##     score_selfctrl + score_anx + score_innov
## 
##                  Df    AIC
## - score_innov     1 6869.1
## - score_idpt      1 6869.3
## - self_gender     1 6869.5
## - score_extro     1 6869.8
## - head_gender     1 6869.9
## - coach           2 6870.3
## - prof           14 6870.7
## - score_selfctrl  1 6871.1
## <none>              6871.1
## - score_anx       1 6872.0
## - trans           2 6876.6
## - age             1 6880.0
## - wage            1 6881.1
## - field           4 6892.5
## - route           4 6892.7
## 
## Step:  AIC=6869.15
## Surv(exp, event) ~ self_gender + age + field + prof + route + 
##     coach + head_gender + wage + trans + score_extro + score_idpt + 
##     score_selfctrl + score_anx
## 
##                  Df    AIC
## - score_idpt      1 6867.3
## - self_gender     1 6867.5
## - score_extro     1 6867.8
## - head_gender     1 6867.9
## - coach           2 6868.3
## - prof           14 6868.8
## <none>              6869.1
## - score_selfctrl  1 6869.4
## - score_anx       1 6870.1
## - trans           2 6874.6
## - age             1 6878.0
## - wage            1 6879.2
## - field           4 6890.6
## - route           4 6890.9
## 
## Step:  AIC=6867.26
## Surv(exp, event) ~ self_gender + age + field + prof + route + 
##     coach + head_gender + wage + trans + score_extro + score_selfctrl + 
##     score_anx
## 
##                  Df    AIC
## - self_gender     1 6865.6
## - head_gender     1 6866.1
## - coach           2 6866.5
## - score_extro     1 6866.5
## - prof           14 6866.8
## <none>              6867.3
## - score_selfctrl  1 6867.7
## - score_anx       1 6868.8
## - trans           2 6872.8
## - age             1 6876.0
## - wage            1 6877.4
## - field           4 6888.7
## - route           4 6889.0
## 
## Step:  AIC=6865.63
## Surv(exp, event) ~ age + field + prof + route + coach + head_gender + 
##     wage + trans + score_extro + score_selfctrl + score_anx
## 
##                  Df    AIC
## - head_gender     1 6864.4
## - coach           2 6864.8
## - score_extro     1 6865.0
## - prof           14 6865.0
## <none>              6865.6
## - score_selfctrl  1 6866.3
## - score_anx       1 6867.9
## - trans           2 6871.2
## - age             1 6874.4
## - wage            1 6876.1
## - route           4 6887.2
## - field           4 6887.7
## 
## Step:  AIC=6864.38
## Surv(exp, event) ~ age + field + prof + route + coach + wage + 
##     trans + score_extro + score_selfctrl + score_anx
## 
##                  Df    AIC
## - coach           2 6863.6
## - score_extro     1 6863.7
## <none>              6864.4
## - prof           14 6864.5
## - score_selfctrl  1 6865.2
## - score_anx       1 6867.1
## - trans           2 6870.1
## - age             1 6874.7
## - wage            1 6874.9
## - route           4 6885.3
## - field           4 6886.2
## 
## Step:  AIC=6863.64
## Surv(exp, event) ~ age + field + prof + route + wage + trans + 
##     score_extro + score_selfctrl + score_anx
## 
##                  Df    AIC
## - score_extro     1 6863.2
## <none>              6863.6
## - prof           14 6864.3
## - score_selfctrl  1 6864.6
## - score_anx       1 6866.1
## - trans           2 6869.0
## - wage            1 6874.5
## - age             1 6875.4
## - route           4 6883.6
## - field           4 6885.6
## 
## Step:  AIC=6863.18
## Surv(exp, event) ~ age + field + prof + route + wage + trans + 
##     score_selfctrl + score_anx
## 
##                  Df    AIC
## <none>              6863.2
## - prof           14 6863.8
## - score_anx       1 6867.1
## - trans           2 6868.6
## - score_selfctrl  1 6869.6
## - wage            1 6873.6
## - age             1 6873.8
## - route           4 6884.1
## - field           4 6885.1

## Call:
## coxph(formula = Surv(exp, event) ~ age + field + prof + route + 
##     wage + trans + score_selfctrl + score_anx, data = emp_df)
## 
##                               coef  exp(coef)   se(coef)      z        p
## age                      0.0229722  1.0232381  0.0063668  3.608 0.000308
## fieldfinance             0.3801051  1.4624382  0.1384227  2.746 0.006033
## fieldIT                 -0.5344071  0.5860166  0.1927055 -2.773 0.005551
## fieldmanufacture        -0.1430499  0.8667108  0.1441057 -0.993 0.320869
## fieldretail             -0.3322475  0.7173097  0.1145145 -2.901 0.003716
## profBusinessDevelopment  0.3706328  1.4486511  0.4894983  0.757 0.448949
## profCommercial           0.9380492  2.5549923  0.4931655  1.902 0.057158
## profConsult              0.8220196  2.2750899  0.4929325  1.668 0.095393
## profEngineer             0.8683370  2.3829446  0.5138028  1.690 0.091024
## profetc                  0.5089503  1.6635441  0.4737691  1.074 0.282707
## profFinance              0.2101460  1.2338582  0.5103214  0.412 0.680492
## profHR                   0.2733241  1.3143262  0.4198372  0.651 0.515031
## profIT                   0.0326334  1.0331718  0.4667599  0.070 0.944261
## profLaw                  0.4650626  1.5921138  0.6153673  0.756 0.449800
## profmanage               1.2247859  3.4034373  0.4910183  2.494 0.012618
## profMarketing            0.7951742  2.2148268  0.4713507  1.687 0.091601
## profPR                   0.8544298  2.3500340  0.6193306  1.380 0.167709
## profSales                0.5113095  1.6674734  0.4512538  1.133 0.257178
## profTeaching             0.5447424  1.7241642  0.5186051  1.050 0.293535
## routefriends            -0.0004327  0.9995674  0.2156620 -0.002 0.998399
## routejobsite_apply       0.4623123  1.5877411  0.1731386  2.670 0.007581
## routejobsite_resume      0.7291377  2.0732920  0.1764617  4.132  3.6e-05
## routerecommend           0.2287965  1.2570862  0.1632106  1.402 0.160961
## wagewhite               -0.4886619  0.6134467  0.1315418 -3.715 0.000203
## transcar                -0.2026201  0.8165884  0.0990296 -2.046 0.040751
## transfoot               -0.4316186  0.6494570  0.1675262 -2.576 0.009983
## score_selfctrl          -0.0635267  0.9384491  0.0219958 -2.888 0.003875
## score_anx               -0.0618985  0.9399783  0.0255270 -2.425 0.015316
## 
## Likelihood ratio test=133.4  on 28 df, p=1.098e-15
## n= 1129, number of events= 571

best_cox_mod <- coxph(formula = Surv(exp, event) ~ age + field + prof + route + 
    wage + trans + score_selfctrl + score_anx, data = emp_df)

Since we assumes that the hazard ratios are proportional over time, not assuming a specific form for the baseline hazard function, we will check for proportional hazard assumptions of this model as below:

# PH-test
ph_test <- cox.zph(best_cox_mod)
print(ph_test)

##                 chisq df       p
## age             0.845  1 0.35798
## field           2.546  4 0.63648
## prof           40.331 14 0.00023
## route           9.072  4 0.05932
## wage            0.778  1 0.37768
## trans           2.270  2 0.32144
## score_selfctrl  0.494  1 0.48200
## score_anx       1.192  1 0.27489
## GLOBAL         55.074 28 0.00167

plot(ph_test)

# Stratify the model by the covariate that violates the assumption
cox_model_strat <- coxph(Surv(exp, event) ~ age + field + strata(prof) + route + 
    wage + trans + score_selfctrl + score_anx, data = emp_df)
summary(cox_model_strat)

## Call:
## coxph(formula = Surv(exp, event) ~ age + field + strata(prof) + 
##     route + wage + trans + score_selfctrl + score_anx, data = emp_df)
## 
##   n= 1129, number of events= 571 
## 
##                          coef exp(coef)  se(coef)      z Pr(>|z|)    
## age                  0.023977  1.024266  0.006529  3.672 0.000240 ***
## fieldfinance         0.351325  1.420950  0.139990  2.510 0.012085 *  
## fieldIT             -0.517867  0.595790  0.195140 -2.654 0.007959 ** 
## fieldmanufacture    -0.200800  0.818076  0.147234 -1.364 0.172626    
## fieldretail         -0.314352  0.730262  0.115144 -2.730 0.006332 ** 
## routefriends        -0.037390  0.963301  0.225313 -0.166 0.868199    
## routejobsite_apply   0.433081  1.542001  0.177583  2.439 0.014738 *  
## routejobsite_resume  0.682893  1.979596  0.180117  3.791 0.000150 ***
## routerecommend       0.249538  1.283432  0.168212  1.483 0.137948    
## wagewhite           -0.469301  0.625440  0.133487 -3.516 0.000439 ***
## transcar            -0.171671  0.842257  0.100130 -1.714 0.086443 .  
## transfoot           -0.421441  0.656100  0.176033 -2.394 0.016661 *  
## score_selfctrl      -0.064877  0.937183  0.022436 -2.892 0.003833 ** 
## score_anx           -0.060366  0.941420  0.025798 -2.340 0.019285 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##                     exp(coef) exp(-coef) lower .95 upper .95
## age                    1.0243     0.9763    1.0112    1.0375
## fieldfinance           1.4209     0.7038    1.0800    1.8696
## fieldIT                0.5958     1.6784    0.4064    0.8734
## fieldmanufacture       0.8181     1.2224    0.6130    1.0917
## fieldretail            0.7303     1.3694    0.5827    0.9151
## routefriends           0.9633     1.0381    0.6194    1.4981
## routejobsite_apply     1.5420     0.6485    1.0887    2.1840
## routejobsite_resume    1.9796     0.5052    1.3908    2.8177
## routerecommend         1.2834     0.7792    0.9230    1.7847
## wagewhite              0.6254     1.5989    0.4815    0.8125
## transcar               0.8423     1.1873    0.6922    1.0249
## transfoot              0.6561     1.5242    0.4647    0.9264
## score_selfctrl         0.9372     1.0670    0.8969    0.9793
## score_anx              0.9414     1.0622    0.8950    0.9902
## 
## Concordance= 0.635  (se = 0.016 )
## Likelihood ratio test= 96.17  on 14 df,   p=3e-14
## Wald test            = 99.54  on 14 df,   p=6e-15
## Score (logrank) test = 101.5  on 14 df,   p=2e-15

After applying stratify for the prof variables, we have the proportional hazard assumption holds for the model.

# Test the proportional hazards assumption
ph_test <- cox.zph(cox_model_strat)

# Print the test results
print(ph_test)

##                   chisq df     p
## age            2.48e-01  1 0.619
## field          1.04e+00  4 0.903
## route          8.47e+00  4 0.076
## wage           2.33e-01  1 0.629
## trans          1.06e+00  2 0.589
## score_selfctrl 4.82e-05  1 0.994
## score_anx      6.95e-01  1 0.404
## GLOBAL         1.30e+01 14 0.524

# Plot the Schoenfeld residuals for each covariate
plot(ph_test)

We have the following insights for the aboeve Cox proportional hazard model: + Employees hired through jobsite applications have a 54.2% higher risk of quitting compared to those from other routes. This may suggest that employees who apply online might be less committed or have more options + Employees hired by submitting resumes to job sites face almost double the risk of quitting compared to other sources, likely due to higher mobility or availability of alternative job offers + Employees receiving white (tax-authorized) wages have a 37.5% lower risk of quitting compared to those paid under a grey wage system. This suggests that legal, tax-compliant wage structures contribute to better retention, likely due to higher job security and benefits. + Employees who commute on foot have a 34.4% lower risk of quitting compared to bus commuters. This could suggest that employees living closer to work may have greater job stability or satisfaction. + For each unit increase in self-control, the risk of quitting decreases by 6.3%. Employees with higher self-control are likely to stay longer, which could be due to their ability to manage work stress and maintain focus on long-term goals. + For each unit increase in anxiety, the risk of quitting decreases by 5.9%. This counterintuitive result suggests that more anxious employees might stay longer, possibly due to a fear of change or job loss.

# Define function for customizing the output for Cox proportional hazard model's results
output_custom <- function(model, exp = FALSE){
  # Extract coefficients
  coeffs <- summary(model)$coefficients
    
  
  output_df <- data.frame(
    estimate = coeffs[,1],
    std_err =  coeffs[,3],
    p_val = coeffs[,5],
    significant = coeffs[,5] <= 0.05
  )
  
  output_df[,1:3] <- round(output_df[,1:3], 3)
  
  if (exp == TRUE){
    output_df$exp_est <- round(coeffs[,2], 3)
    output_df <- output_df %>% relocate(exp_est, .after = estimate)
  }
  
  return (output_df)
}

output_custom(cox_model_strat, exp = TRUE)

##                     estimate exp_est std_err p_val significant
## age                    0.024   1.024   0.007 0.000        TRUE
## fieldfinance           0.351   1.421   0.140 0.012        TRUE
## fieldIT               -0.518   0.596   0.195 0.008        TRUE
## fieldmanufacture      -0.201   0.818   0.147 0.173       FALSE
## fieldretail           -0.314   0.730   0.115 0.006        TRUE
## routefriends          -0.037   0.963   0.225 0.868       FALSE
## routejobsite_apply     0.433   1.542   0.178 0.015        TRUE
## routejobsite_resume    0.683   1.980   0.180 0.000        TRUE
## routerecommend         0.250   1.283   0.168 0.138       FALSE
## wagewhite             -0.469   0.625   0.133 0.000        TRUE
## transcar              -0.172   0.842   0.100 0.086       FALSE
## transfoot             -0.421   0.656   0.176 0.017        TRUE
## score_selfctrl        -0.065   0.937   0.022 0.004        TRUE
## score_anx             -0.060   0.941   0.026 0.019        TRUE

# Calculate the concordance index
c_index_cox <- concordance(cox_model_strat)
print(c_index_cox$concordance)

## [1] 0.6352516

cox_model_strat_final <- coxph(Surv(exp, event) ~ age + field + strata(prof) + route + 
    coach + head_gender + wage + trans + score_selfctrl + score_anx +score_anx, data = emp_df)
summary(cox_model_strat_final)

## Call:
## coxph(formula = Surv(exp, event) ~ age + field + strata(prof) + 
##     route + coach + head_gender + wage + trans + score_selfctrl + 
##     score_anx + score_anx, data = emp_df)
## 
##   n= 1129, number of events= 571 
## 
##                          coef exp(coef)  se(coef)      z Pr(>|z|)    
## age                  0.022247  1.022496  0.006919  3.215 0.001304 ** 
## fieldfinance         0.335279  1.398331  0.141756  2.365 0.018021 *  
## fieldIT             -0.541747  0.581731  0.195788 -2.767 0.005657 ** 
## fieldmanufacture    -0.200710  0.818150  0.149028 -1.347 0.178048    
## fieldretail         -0.323286  0.723767  0.115266 -2.805 0.005036 ** 
## routefriends        -0.034195  0.966383  0.226005 -0.151 0.879738    
## routejobsite_apply   0.442925  1.557255  0.178815  2.477 0.013249 *  
## routejobsite_resume  0.713682  2.041494  0.181911  3.923 8.74e-05 ***
## routerecommend       0.248957  1.282687  0.168657  1.476 0.139912    
## coachmy head        -0.081567  0.921671  0.110288 -0.740 0.459554    
## coachyes             0.168370  1.183375  0.139315  1.209 0.226833    
## head_genderm         0.129126  1.137834  0.098767  1.307 0.191084    
## wagewhite           -0.466276  0.627334  0.134182 -3.475 0.000511 ***
## transcar            -0.174091  0.840220  0.100448 -1.733 0.083070 .  
## transfoot           -0.428388  0.651558  0.176444 -2.428 0.015186 *  
## score_selfctrl      -0.061366  0.940479  0.022521 -2.725 0.006434 ** 
## score_anx           -0.057210  0.944396  0.026122 -2.190 0.028518 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##                     exp(coef) exp(-coef) lower .95 upper .95
## age                    1.0225     0.9780    1.0087    1.0365
## fieldfinance           1.3983     0.7151    1.0591    1.8462
## fieldIT                0.5817     1.7190    0.3963    0.8538
## fieldmanufacture       0.8181     1.2223    0.6109    1.0957
## fieldretail            0.7238     1.3817    0.5774    0.9072
## routefriends           0.9664     1.0348    0.6205    1.5050
## routejobsite_apply     1.5573     0.6422    1.0969    2.2109
## routejobsite_resume    2.0415     0.4898    1.4292    2.9160
## routerecommend         1.2827     0.7796    0.9216    1.7852
## coachmy head           0.9217     1.0850    0.7425    1.1441
## coachyes               1.1834     0.8450    0.9006    1.5549
## head_genderm           1.1378     0.8789    0.9376    1.3809
## wagewhite              0.6273     1.5940    0.4823    0.8160
## transcar               0.8402     1.1902    0.6901    1.0230
## transfoot              0.6516     1.5348    0.4611    0.9208
## score_selfctrl         0.9405     1.0633    0.8999    0.9829
## score_anx              0.9444     1.0589    0.8973    0.9940
## 
## Concordance= 0.641  (se = 0.015 )
## Likelihood ratio test= 100.7  on 17 df,   p=7e-14
## Wald test            = 104.3  on 17 df,   p=1e-14
## Score (logrank) test = 106.2  on 17 df,   p=6e-15

With a C-index of 0.635 indicates the Cox proportional hazard model is better than random but has moderate predictive ability. Therefore, we can consider the non-linear model such as Random Forest Survival models and Gradient Boosting Survival models.

Uplift Survival Analysis

library(tools4uplift)

## Warning: package 'tools4uplift' was built under R version 4.3.3

Based on the above results, since we confirm that there’s difference in terms of retention rate with the log-rank test for the following variable: - Industry Sector (field) - Profession (prof) - Source came to Company (route) - Wage tax authorized or not (wage) - Mode of Transportation (trans)

We will conduct quasi-experimental analysis for each of those factors (treat each of those variables as potential treatment-like interventions that might affect employee retention)

Industry Sector (Field)

We have a hypothesis that different industries may respond differently to retention strategies like training or employee engagement program with better communication.

We will test if employees in retail or finance sectors have better retention when given more flexible schedules or additional professional development.

temp <- data.frame(emp_df)
temp$treatment <- ifelse(temp$field %in% c('finance', 'IT'), 1, 0)
head(temp)

##         exp event self_gender age       field       prof          route coach
## 1  7.030801     1           m  35     finance         HR      recommend    no
## 2 22.965092     1           m  33     finance         HR jobsite_resume    no
## 3 15.934292     1           f  35      others         HR      recommend    no
## 4 15.934292     1           f  35      others         HR      recommend    no
## 5  8.410678     1           m  32      retail Commercial  jobsite_apply   yes
## 6  8.969199     1           f  42 manufacture         HR jobsite_resume   yes
##   head_gender  wage trans score_extro score_idpt score_selfctrl score_anx
## 1           f white   bus         6.2        4.1            5.7       7.1
## 2           m white   bus         6.2        4.1            5.7       7.1
## 3           m white   bus         6.2        6.2            2.6       4.8
## 4           m white   bus         5.4        7.6            4.9       2.5
## 5           f white   bus         3.0        4.1            8.0       7.1
## 6           m white   bus         6.2        6.2            4.1       5.6
##   score_innov treatment
## 1         8.3         1
## 2         8.3         1
## 3         8.3         0
## 4         6.7         0
## 5         3.7         0
## 6         6.7         0

table(emp_df$route)

## 
##         others        friends  jobsite_apply jobsite_resume      recommend 
##            100            118            318            248            345

# Recreate emp_df_copy to avoid modifying the original data
emp_df_copy <- temp[, c("treatment", "event", "route", "wage", "trans", "score_selfctrl", "score_anx")]

# Ensure factor levels in the dataset are consistent
emp_df_copy$route <- factor(emp_df_copy$route)
emp_df_copy$trans <- factor(emp_df_copy$trans)

# Fit the two-model uplift regression
uplift_model <- DualUplift(emp_df_copy, treat = "treatment", outcome = "event", 
                           predictors = c("route", "wage", "trans", "score_selfctrl", "score_anx"))

print(uplift_model)

## $call
## DualUplift(data = emp_df_copy, treat = "treatment", outcome = "event", 
##     predictors = c("route", "wage", "trans", "score_selfctrl", 
##         "score_anx"))
## 
## $control
## 
## Call:  "Control Group Logistic Model"
## 
## Coefficients:
##         (Intercept)         routefriends   routejobsite_apply  
##            1.232908            -0.529325            -0.289496  
## routejobsite_resume       routerecommend            wagewhite  
##           -0.009535             0.302337            -0.239487  
##            transcar            transfoot       score_selfctrl  
##           -0.055595            -0.558426            -0.070139  
##           score_anx  
##           -0.075524  
## 
## Degrees of Freedom: 892 Total (i.e. Null);  883 Residual
## Null Deviance:       1237 
## Residual Deviance: 1205  AIC: 1225
## 
## $treatment
## 
## Call:  "Treatment Group Logistic Model"
## 
## Coefficients:
##         (Intercept)         routefriends   routejobsite_apply  
##             0.01622             -1.46029             -0.13893  
## routejobsite_resume       routerecommend            wagewhite  
##             0.41459              0.83327             -0.12025  
##            transcar            transfoot       score_selfctrl  
##            -0.04992             -0.79294              0.07835  
##           score_anx  
##            -0.10638  
## 
## Degrees of Freedom: 235 Total (i.e. Null);  226 Residual
## Null Deviance:       325.8 
## Residual Deviance: 300.9     AIC: 320.9
## 
## attr(,"class")
## [1] "print.DualUplift"

summary(uplift_model)

## $call
## DualUplift(data = emp_df_copy, treat = "treatment", outcome = "event", 
##     predictors = c("route", "wage", "trans", "score_selfctrl", 
##         "score_anx"))
## 
## $control
## 
## Call:
## "Control Group Logistic Model"
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)   
## (Intercept)          1.232908   0.437729   2.817  0.00485 **
## routefriends        -0.529325   0.310754  -1.703  0.08850 . 
## routejobsite_apply  -0.289496   0.263785  -1.097  0.27244   
## routejobsite_resume -0.009535   0.272435  -0.035  0.97208   
## routerecommend       0.302337   0.260759   1.159  0.24627   
## wagewhite           -0.239487   0.211319  -1.133  0.25709   
## transcar            -0.055595   0.153640  -0.362  0.71746   
## transfoot           -0.558426   0.240002  -2.327  0.01998 * 
## score_selfctrl      -0.070139   0.035701  -1.965  0.04946 * 
## score_anx           -0.075524   0.041371  -1.826  0.06792 . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1236.9  on 892  degrees of freedom
## Residual deviance: 1205.0  on 883  degrees of freedom
## AIC: 1225
## 
## Number of Fisher Scoring iterations: 4
## 
## 
## $treatment
## 
## Call:
## "Treatment Group Logistic Model"
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)  
## (Intercept)          0.01622    0.95179   0.017   0.9864  
## routefriends        -1.46029    0.83021  -1.759   0.0786 .
## routejobsite_apply  -0.13893    0.60133  -0.231   0.8173  
## routejobsite_resume  0.41459    0.61215   0.677   0.4982  
## routerecommend       0.83327    0.57672   1.445   0.1485  
## wagewhite           -0.12025    0.52693  -0.228   0.8195  
## transcar            -0.04992    0.34758  -0.144   0.8858  
## transfoot           -0.79294    0.51682  -1.534   0.1250  
## score_selfctrl       0.07835    0.07087   1.105   0.2690  
## score_anx           -0.10638    0.07914  -1.344   0.1789  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 325.79  on 235  degrees of freedom
## Residual deviance: 300.88  on 226  degrees of freedom
## AIC: 320.88
## 
## Number of Fisher Scoring iterations: 4
## 
## 
## attr(,"class")
## [1] "summary.DualUplift"

For Control Group transfoot: Negative effect (p = 0.020), indicating employees using foot transportation are less likely to experience the event. score_selfctrl: Negative effect (p = 0.049), suggesting lower self-control scores increase event likelihood. routefriends (p = 0.088) and score_anx (p = 0.068) are borderline significant, hinting at potential effects on the event.
For Treatment Group: routefriends (p = 0.079) is close to significant, indicating this route might influence the event. Other predictors show weaker effects, suggesting little impact in the treatment group.

Insights based on this: transfoot: Employees commuting by foot are significantly more likely to quit in both control and treatment groups. Consider offering support like transportation benefits or flexible working arrangements. score_selfctrl: Low self-control is linked to higher quit rates, especially in the control group. Implement mentoring, goal-setting, or time management programs.

Prof group

We have a hypothesis that employees in different professions may have different retention risks due to job roles, job market demand, or career development opportunities.

We will test if employees in high demand professions versus others to identify where retention strategies might have most uplift.

# Create grouped 'prof' variable
emp_df$prof_group <- with(emp_df, ifelse(prof %in% c("BusinessDevelopment", "manage", "Consult"), "Management & Leadership",
                             ifelse(prof %in% c("Engineer", "IT"), "Technical & Engineering",
                             ifelse(prof %in% c("Sales", "Marketing", "Commercial", "PR"), "Sales, Marketing & Commercial",
                             ifelse(prof %in% c("Accounting", "Finance", "Law"), "Finance & Legal",
                             ifelse(prof == "HR", "HR & Administration",
                             ifelse(prof == "Teaching", "Specialized Roles", "Other")))))))

table(emp_df$prof_group)

## 
##               Finance & Legal           HR & Administration 
##                            34                           757 
##       Management & Leadership                         Other 
##                            74                            37 
## Sales, Marketing & Commercial             Specialized Roles 
##                           126                            12 
##       Technical & Engineering 
##                            89

temp <- data.frame(emp_df)
temp$treatment <- ifelse(temp$prof_group %in% c("Technical & Engineering"), 1, 0)
head(temp)

##         exp event self_gender age       field       prof          route coach
## 1  7.030801     1           m  35     finance         HR      recommend    no
## 2 22.965092     1           m  33     finance         HR jobsite_resume    no
## 3 15.934292     1           f  35      others         HR      recommend    no
## 4 15.934292     1           f  35      others         HR      recommend    no
## 5  8.410678     1           m  32      retail Commercial  jobsite_apply   yes
## 6  8.969199     1           f  42 manufacture         HR jobsite_resume   yes
##   head_gender  wage trans score_extro score_idpt score_selfctrl score_anx
## 1           f white   bus         6.2        4.1            5.7       7.1
## 2           m white   bus         6.2        4.1            5.7       7.1
## 3           m white   bus         6.2        6.2            2.6       4.8
## 4           m white   bus         5.4        7.6            4.9       2.5
## 5           f white   bus         3.0        4.1            8.0       7.1
## 6           m white   bus         6.2        6.2            4.1       5.6
##   score_innov                    prof_group treatment
## 1         8.3           HR & Administration         0
## 2         8.3           HR & Administration         0
## 3         8.3           HR & Administration         0
## 4         6.7           HR & Administration         0
## 5         3.7 Sales, Marketing & Commercial         0
## 6         6.7           HR & Administration         0

# Recreate emp_df_copy to avoid modifying the original data
temp_sub <- temp[, c("treatment", "event", "route", "wage", "trans", "score_selfctrl", "score_anx")]

# Fit the two-model uplift regression
uplift_model <- DualUplift(temp_sub, treat = "treatment", outcome = "event", 
                           predictors = c("route", "wage", "trans", "score_selfctrl", "score_anx"))

print(uplift_model)

## $call
## DualUplift(data = temp_sub, treat = "treatment", outcome = "event", 
##     predictors = c("route", "wage", "trans", "score_selfctrl", 
##         "score_anx"))
## 
## $control
## 
## Call:  "Control Group Logistic Model"
## 
## Coefficients:
##         (Intercept)         routefriends   routejobsite_apply  
##             1.03269             -0.49600             -0.24826  
## routejobsite_resume       routerecommend            wagewhite  
##             0.11722              0.40538             -0.29556  
##            transcar            transfoot       score_selfctrl  
##             0.03696             -0.49857             -0.03805  
##           score_anx  
##            -0.08704  
## 
## Degrees of Freedom: 1039 Total (i.e. Null);  1030 Residual
## Null Deviance:       1441 
## Residual Deviance: 1401  AIC: 1421
## 
## $treatment
## 
## Call:  "Treatment Group Logistic Model"
## 
## Coefficients:
##         (Intercept)         routefriends   routejobsite_apply  
##           -18.68276             15.03385             16.38026  
## routejobsite_resume       routerecommend            wagewhite  
##            16.18549             17.32853              0.40742  
##            transcar            transfoot       score_selfctrl  
##            -0.99190             -2.88253              0.00728  
##           score_anx  
##             0.30835  
## 
## Degrees of Freedom: 88 Total (i.e. Null);  79 Residual
## Null Deviance:       120.1 
## Residual Deviance: 98.93     AIC: 118.9
## 
## attr(,"class")
## [1] "print.DualUplift"

summary(uplift_model)

## $call
## DualUplift(data = temp_sub, treat = "treatment", outcome = "event", 
##     predictors = c("route", "wage", "trans", "score_selfctrl", 
##         "score_anx"))
## 
## $control
## 
## Call:
## "Control Group Logistic Model"
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)  
## (Intercept)          1.03269    0.40116   2.574   0.0100 *
## routefriends        -0.49600    0.29132  -1.703   0.0886 .
## routejobsite_apply  -0.24826    0.24084  -1.031   0.3026  
## routejobsite_resume  0.11722    0.24986   0.469   0.6390  
## routerecommend       0.40538    0.23685   1.712   0.0870 .
## wagewhite           -0.29556    0.20283  -1.457   0.1451  
## transcar             0.03696    0.14485   0.255   0.7986  
## transfoot           -0.49857    0.22227  -2.243   0.0249 *
## score_selfctrl      -0.03805    0.03267  -1.165   0.2442  
## score_anx           -0.08704    0.03792  -2.296   0.0217 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1440.9  on 1039  degrees of freedom
## Residual deviance: 1401.3  on 1030  degrees of freedom
## AIC: 1421.3
## 
## Number of Fisher Scoring iterations: 4
## 
## 
## $treatment
## 
## Call:
## "Treatment Group Logistic Model"
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)  
## (Intercept)          -18.68276 1295.38414  -0.014   0.9885  
## routefriends          15.03385 1295.38305   0.012   0.9907  
## routejobsite_apply    16.38026 1295.38292   0.013   0.9899  
## routejobsite_resume   16.18549 1295.38289   0.012   0.9900  
## routerecommend        17.32853 1295.38295   0.013   0.9893  
## wagewhite              0.40742    0.83845   0.486   0.6270  
## transcar              -0.99190    0.55955  -1.773   0.0763 .
## transfoot             -2.88253    1.17639  -2.450   0.0143 *
## score_selfctrl         0.00728    0.11667   0.062   0.9502  
## score_anx              0.30834    0.16689   1.848   0.0647 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 120.113  on 88  degrees of freedom
## Residual deviance:  98.931  on 79  degrees of freedom
## AIC: 118.93
## 
## Number of Fisher Scoring iterations: 15
## 
## 
## attr(,"class")
## [1] "summary.DualUplift"

Control group:
Employees with higher anxiety are more likely to quit.
Those who walk to work are more likely to quit.
Referral and friendship-based hires show some indication of lower turnover, though not strongly significant.
Treatment group:
Transportation (walking and driving) compared to go by bus seems to have a strong effect on turnover for IT & Engineering professionals.
Anxiety remains a factor in predicting quitting behavior, with marginal significance.

Insights: + Anxiety is a significant factor in both control and treatment groups. Consider interventions like mental health support or stress management programs for employees with high anxiety scores. + In the treatment group, employees who drive to work seem to have lower turnover. You may want to target them for special retention programs or further incentives as they seem less likely to quit. + The control group shows that hiring through friends or recommendations might lead to lower turnover. While the treatment group results for route variables are inconclusive, consider revisiting how IT & Engineering employees are sourced. Referral programs might yield better long-term retention.

chosen_factors = c("field", "prof", "route", "wage", "trans")
unique_val_lst <- list()
for (col in chosen_factors){
  unique_val_lst[[col]] = unique(emp_df[col])
}

for (col in names(unique_val_lst)) {
  print(unique_val_lst[[col]])
}

##          field
## 1      finance
## 3       others
## 5       retail
## 6  manufacture
## 18          IT
##                    prof
## 1                    HR
## 5            Commercial
## 10            Marketing
## 14                  etc
## 18                Sales
## 22  BusinessDevelopment
## 23              Finance
## 53             Teaching
## 68               manage
## 109                  IT
## 121                 Law
## 135             Consult
## 138            Engineer
## 279                  PR
## 356          Accounting
##              route
## 1        recommend
## 2   jobsite_resume
## 5    jobsite_apply
## 14          others
## 313        friends
##     wage
## 1  white
## 19  grey
##    trans
## 1    bus
## 11   car
## 13  foot

Route

We hypothesize that the source through which an employee was recruited might affect their retention. For example, employees recruited through referrals may have a stronger bond with the company than those recruited via job sites. Therefore, we use recruitment source with more focusing on referral as a treatment-like variable to target specific interventions to employees from less effective pipelines.

temp$treatment <- ifelse(temp$route %in% c("friends", "recommend"), 1, 0)
head(temp)

##         exp event self_gender age       field       prof          route coach
## 1  7.030801     1           m  35     finance         HR      recommend    no
## 2 22.965092     1           m  33     finance         HR jobsite_resume    no
## 3 15.934292     1           f  35      others         HR      recommend    no
## 4 15.934292     1           f  35      others         HR      recommend    no
## 5  8.410678     1           m  32      retail Commercial  jobsite_apply   yes
## 6  8.969199     1           f  42 manufacture         HR jobsite_resume   yes
##   head_gender  wage trans score_extro score_idpt score_selfctrl score_anx
## 1           f white   bus         6.2        4.1            5.7       7.1
## 2           m white   bus         6.2        4.1            5.7       7.1
## 3           m white   bus         6.2        6.2            2.6       4.8
## 4           m white   bus         5.4        7.6            4.9       2.5
## 5           f white   bus         3.0        4.1            8.0       7.1
## 6           m white   bus         6.2        6.2            4.1       5.6
##   score_innov                    prof_group treatment
## 1         8.3           HR & Administration         1
## 2         8.3           HR & Administration         0
## 3         8.3           HR & Administration         1
## 4         6.7           HR & Administration         1
## 5         3.7 Sales, Marketing & Commercial         0
## 6         6.7           HR & Administration         0

# Recreate emp_df_copy to avoid modifying the original data
temp_sub <- temp[, c("treatment", "event", "prof_group", "wage", "trans", "score_selfctrl", "score_anx")]

# Fit the two-model uplift regression
uplift_model <- DualUplift(temp_sub, treat = "treatment", outcome = "event", 
                           predictors = c("prof_group", "wage", "trans", "score_selfctrl", "score_anx"))

print(uplift_model)

## $call
## DualUplift(data = temp_sub, treat = "treatment", outcome = "event", 
##     predictors = c("prof_group", "wage", "trans", "score_selfctrl", 
##         "score_anx"))
## 
## $control
## 
## Call:  "Control Group Logistic Model"
## 
## Coefficients:
##                             (Intercept)  
##                                 0.64920  
##           prof_groupHR & Administration  
##                                -0.15084  
##       prof_groupManagement & Leadership  
##                                 0.73692  
##                         prof_groupOther  
##                                 0.07959  
## prof_groupSales, Marketing & Commercial  
##                                 0.60613  
##             prof_groupSpecialized Roles  
##                                14.67774  
##       prof_groupTechnical & Engineering  
##                                -0.41039  
##                               wagewhite  
##                                -0.15335  
##                                transcar  
##                                -0.21582  
##                               transfoot  
##                                -0.80037  
##                          score_selfctrl  
##                                -0.02237  
##                               score_anx  
##                                -0.05317  
## 
## Degrees of Freedom: 665 Total (i.e. Null);  654 Residual
## Null Deviance:       922.3 
## Residual Deviance: 893.7     AIC: 917.7
## 
## $treatment
## 
## Call:  "Treatment Group Logistic Model"
## 
## Coefficients:
##                             (Intercept)  
##                                 2.33796  
##           prof_groupHR & Administration  
##                                -0.97177  
##       prof_groupManagement & Leadership  
##                                -0.36595  
##                         prof_groupOther  
##                                -0.37652  
## prof_groupSales, Marketing & Commercial  
##                                -0.62287  
##             prof_groupSpecialized Roles  
##                                14.55454  
##       prof_groupTechnical & Engineering  
##                                -1.09251  
##                               wagewhite  
##                                -0.43483  
##                                transcar  
##                                 0.25725  
##                               transfoot  
##                                -0.54997  
##                          score_selfctrl  
##                                -0.07255  
##                               score_anx  
##                                -0.10988  
## 
## Degrees of Freedom: 462 Total (i.e. Null);  451 Residual
## Null Deviance:       638.6 
## Residual Deviance: 605.2     AIC: 629.2
## 
## attr(,"class")
## [1] "print.DualUplift"

summary(uplift_model)

## $call
## DualUplift(data = temp_sub, treat = "treatment", outcome = "event", 
##     predictors = c("prof_group", "wage", "trans", "score_selfctrl", 
##         "score_anx"))
## 
## $control
## 
## Call:
## "Control Group Logistic Model"
## 
## Coefficients:
##                                          Estimate Std. Error z value Pr(>|z|)
## (Intercept)                               0.64920    0.85353   0.761  0.44690
## prof_groupHR & Administration            -0.15084    0.71712  -0.210  0.83340
## prof_groupManagement & Leadership         0.73692    0.80484   0.916  0.35987
## prof_groupOther                           0.07959    0.82872   0.096  0.92349
## prof_groupSales, Marketing & Commercial   0.60613    0.75616   0.802  0.42279
## prof_groupSpecialized Roles              14.67774  499.14355   0.029  0.97654
## prof_groupTechnical & Engineering        -0.41039    0.77131  -0.532  0.59468
## wagewhite                                -0.15335    0.23847  -0.643  0.52018
## transcar                                 -0.21582    0.18495  -1.167  0.24324
## transfoot                                -0.80037    0.28325  -2.826  0.00472
## score_selfctrl                           -0.02237    0.04030  -0.555  0.57878
## score_anx                                -0.05317    0.04769  -1.115  0.26488
##                                           
## (Intercept)                               
## prof_groupHR & Administration             
## prof_groupManagement & Leadership         
## prof_groupOther                           
## prof_groupSales, Marketing & Commercial   
## prof_groupSpecialized Roles               
## prof_groupTechnical & Engineering         
## wagewhite                                 
## transcar                                  
## transfoot                               **
## score_selfctrl                            
## score_anx                                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 922.26  on 665  degrees of freedom
## Residual deviance: 893.65  on 654  degrees of freedom
## AIC: 917.65
## 
## Number of Fisher Scoring iterations: 13
## 
## 
## $treatment
## 
## Call:
## "Treatment Group Logistic Model"
## 
## Coefficients:
##                                          Estimate Std. Error z value Pr(>|z|)
## (Intercept)                               2.33796    0.76770   3.045  0.00232
## prof_groupHR & Administration            -0.97177    0.46775  -2.078  0.03775
## prof_groupManagement & Leadership        -0.36595    0.55536  -0.659  0.50993
## prof_groupOther                          -0.37652    0.72293  -0.521  0.60250
## prof_groupSales, Marketing & Commercial  -0.62287    0.53398  -1.166  0.24343
## prof_groupSpecialized Roles              14.55454  477.12566   0.031  0.97566
## prof_groupTechnical & Engineering        -1.09251    0.55299  -1.976  0.04819
## wagewhite                                -0.43483    0.35997  -1.208  0.22706
## transcar                                  0.25725    0.21271   1.209  0.22650
## transfoot                                -0.54997    0.34282  -1.604  0.10866
## score_selfctrl                           -0.07255    0.04982  -1.456  0.14534
## score_anx                                -0.10988    0.05867  -1.873  0.06111
##                                           
## (Intercept)                             **
## prof_groupHR & Administration           * 
## prof_groupManagement & Leadership         
## prof_groupOther                           
## prof_groupSales, Marketing & Commercial   
## prof_groupSpecialized Roles               
## prof_groupTechnical & Engineering       * 
## wagewhite                                 
## transcar                                  
## transfoot                                 
## score_selfctrl                            
## score_anx                               . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 638.57  on 462  degrees of freedom
## Residual deviance: 605.23  on 451  degrees of freedom
## AIC: 629.23
## 
## Number of Fisher Scoring iterations: 14
## 
## 
## attr(,"class")
## [1] "summary.DualUplift"

Control Group:
Walking to work is the most significant predictor of turnover in the control group.
Treatment Group:
Referral-based hires in HR & Administration and Technical & Engineering roles are less likely to quit, suggesting that referrals work well for these professions.
Anxiety may increase the likelihood of quitting, even for referred employees. Insights:
The model shows that HR & Administration and Technical & Engineering employees who were referred are less likely to quit. This suggests that referral programs are particularly effective for these roles. Strengthen and promote referral incentives for these professions to further enhance retention.
For other professions like Sales & Marketing or Management, the referral system may not be as effective
Walking to work is consistently associated with higher turnover in both the control and treatment groups
Anxiety is a marginally significant predictor of quitting, particularly in the treatment group.

Wage Tax Authorized or not (Wage)

We think that employees with unauthorized wages (grey wages) might be less satisfied or less committed due to unstable or unreported income, leading to higher turnover. Therefore, we decided to investigate more on the differential impact of interventions (more transparency in compensation) on employee with grey wages

temp$treatment <- ifelse(temp$wage %in% c("grey"), 1, 0)
head(temp)

##         exp event self_gender age       field       prof          route coach
## 1  7.030801     1           m  35     finance         HR      recommend    no
## 2 22.965092     1           m  33     finance         HR jobsite_resume    no
## 3 15.934292     1           f  35      others         HR      recommend    no
## 4 15.934292     1           f  35      others         HR      recommend    no
## 5  8.410678     1           m  32      retail Commercial  jobsite_apply   yes
## 6  8.969199     1           f  42 manufacture         HR jobsite_resume   yes
##   head_gender  wage trans score_extro score_idpt score_selfctrl score_anx
## 1           f white   bus         6.2        4.1            5.7       7.1
## 2           m white   bus         6.2        4.1            5.7       7.1
## 3           m white   bus         6.2        6.2            2.6       4.8
## 4           m white   bus         5.4        7.6            4.9       2.5
## 5           f white   bus         3.0        4.1            8.0       7.1
## 6           m white   bus         6.2        6.2            4.1       5.6
##   score_innov                    prof_group treatment
## 1         8.3           HR & Administration         0
## 2         8.3           HR & Administration         0
## 3         8.3           HR & Administration         0
## 4         6.7           HR & Administration         0
## 5         3.7 Sales, Marketing & Commercial         0
## 6         6.7           HR & Administration         0

# Recreate emp_df_copy to avoid modifying the original data
temp_sub <- temp[, c("treatment", "event", "prof_group", "route", "trans", "score_selfctrl", "score_anx")]

# Fit the two-model uplift regression
uplift_model <- DualUplift(temp_sub, treat = "treatment", outcome = "event", 
                           predictors = c("prof_group", "route", "trans", "score_selfctrl", "score_anx"))

print(uplift_model)

## $call
## DualUplift(data = temp_sub, treat = "treatment", outcome = "event", 
##     predictors = c("prof_group", "route", "trans", "score_selfctrl", 
##         "score_anx"))
## 
## $control
## 
## Call:  "Control Group Logistic Model"
## 
## Coefficients:
##                             (Intercept)  
##                                 0.99707  
##           prof_groupHR & Administration  
##                                -0.71578  
##       prof_groupManagement & Leadership  
##                                 0.06597  
##                         prof_groupOther  
##                                -0.33491  
## prof_groupSales, Marketing & Commercial  
##                                -0.17960  
##             prof_groupSpecialized Roles  
##                                14.86040  
##       prof_groupTechnical & Engineering  
##                                -0.89308  
##                            routefriends  
##                                -0.38778  
##                      routejobsite_apply  
##                                 0.05805  
##                     routejobsite_resume  
##                                 0.46337  
##                          routerecommend  
##                                 0.59129  
##                                transcar  
##                                 0.05342  
##                               transfoot  
##                                -0.66836  
##                          score_selfctrl  
##                                -0.04570  
##                               score_anx  
##                                -0.06744  
## 
## Degrees of Freedom: 1001 Total (i.e. Null);  987 Residual
## Null Deviance:       1389 
## Residual Deviance: 1317  AIC: 1347
## 
## $treatment
## 
## Call:  "Treatment Group Logistic Model"
## 
## Coefficients:
##                             (Intercept)  
##                               17.473756  
##           prof_groupHR & Administration  
##                              -15.725110  
##       prof_groupManagement & Leadership  
##                              -15.214430  
##                         prof_groupOther  
##                              -16.717607  
## prof_groupSales, Marketing & Commercial  
##                              -14.624807  
##             prof_groupSpecialized Roles  
##                                0.117548  
##       prof_groupTechnical & Engineering  
##                              -16.253123  
##                            routefriends  
##                               -0.017423  
##                      routejobsite_apply  
##                               -0.553182  
##                     routejobsite_resume  
##                               -0.698116  
##                          routerecommend  
##                               -0.143729  
##                                transcar  
##                               -0.841101  
##                               transfoot  
##                               -1.204732  
##                          score_selfctrl  
##                               -0.003691  
##                               score_anx  
##                               -0.153623  
## 
## Degrees of Freedom: 126 Total (i.e. Null);  112 Residual
## Null Deviance:       173.2 
## Residual Deviance: 155.5     AIC: 185.5
## 
## attr(,"class")
## [1] "print.DualUplift"

summary(uplift_model)

## $call
## DualUplift(data = temp_sub, treat = "treatment", outcome = "event", 
##     predictors = c("prof_group", "route", "trans", "score_selfctrl", 
##         "score_anx"))
## 
## $control
## 
## Call:
## "Control Group Logistic Model"
## 
## Coefficients:
##                                          Estimate Std. Error z value Pr(>|z|)
## (Intercept)                               0.99707    0.55491   1.797  0.07237
## prof_groupHR & Administration            -0.71578    0.39378  -1.818  0.06911
## prof_groupManagement & Leadership         0.06597    0.46080   0.143  0.88617
## prof_groupOther                          -0.33491    0.52482  -0.638  0.52338
## prof_groupSales, Marketing & Commercial  -0.17960    0.44020  -0.408  0.68327
## prof_groupSpecialized Roles              14.86040  445.74264   0.033  0.97340
## prof_groupTechnical & Engineering        -0.89308    0.44975  -1.986  0.04706
## routefriends                             -0.38778    0.30564  -1.269  0.20453
## routejobsite_apply                        0.05805    0.26194   0.222  0.82460
## routejobsite_resume                       0.46337    0.27163   1.706  0.08803
## routerecommend                            0.59129    0.25596   2.310  0.02089
## transcar                                  0.05342    0.15112   0.353  0.72374
## transfoot                                -0.66836    0.22580  -2.960  0.00308
## score_selfctrl                           -0.04570    0.03391  -1.348  0.17771
## score_anx                                -0.06744    0.04039  -1.670  0.09496
##                                           
## (Intercept)                             . 
## prof_groupHR & Administration           . 
## prof_groupManagement & Leadership         
## prof_groupOther                           
## prof_groupSales, Marketing & Commercial   
## prof_groupSpecialized Roles               
## prof_groupTechnical & Engineering       * 
## routefriends                              
## routejobsite_apply                        
## routejobsite_resume                     . 
## routerecommend                          * 
## transcar                                  
## transfoot                               **
## score_selfctrl                            
## score_anx                               . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1389.0  on 1001  degrees of freedom
## Residual deviance: 1316.9  on  987  degrees of freedom
## AIC: 1346.9
## 
## Number of Fisher Scoring iterations: 14
## 
## 
## $treatment
## 
## Call:
## "Treatment Group Logistic Model"
## 
## Coefficients:
##                                           Estimate Std. Error z value Pr(>|z|)
## (Intercept)                              1.747e+01  1.692e+03   0.010   0.9918
## prof_groupHR & Administration           -1.573e+01  1.692e+03  -0.009   0.9926
## prof_groupManagement & Leadership       -1.521e+01  1.692e+03  -0.009   0.9928
## prof_groupOther                         -1.672e+01  1.692e+03  -0.010   0.9921
## prof_groupSales, Marketing & Commercial -1.462e+01  1.692e+03  -0.009   0.9931
## prof_groupSpecialized Roles              1.175e-01  2.393e+03   0.000   1.0000
## prof_groupTechnical & Engineering       -1.625e+01  1.692e+03  -0.010   0.9923
## routefriends                            -1.742e-02  1.154e+00  -0.015   0.9880
## routejobsite_apply                      -5.532e-01  8.355e-01  -0.662   0.5079
## routejobsite_resume                     -6.981e-01  8.280e-01  -0.843   0.3992
## routerecommend                          -1.437e-01  8.453e-01  -0.170   0.8650
## transcar                                -8.411e-01  4.528e-01  -1.857   0.0632
## transfoot                               -1.205e+00  2.033e+00  -0.593   0.5534
## score_selfctrl                          -3.691e-03  9.783e-02  -0.038   0.9699
## score_anx                               -1.536e-01  1.080e-01  -1.422   0.1550
##                                          
## (Intercept)                              
## prof_groupHR & Administration            
## prof_groupManagement & Leadership        
## prof_groupOther                          
## prof_groupSales, Marketing & Commercial  
## prof_groupSpecialized Roles              
## prof_groupTechnical & Engineering        
## routefriends                             
## routejobsite_apply                       
## routejobsite_resume                      
## routerecommend                           
## transcar                                .
## transfoot                                
## score_selfctrl                           
## score_anx                                
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 173.21  on 126  degrees of freedom
## Residual deviance: 155.46  on 112  degrees of freedom
## AIC: 185.46
## 
## Number of Fisher Scoring iterations: 15
## 
## 
## attr(,"class")
## [1] "summary.DualUplift"

Control Group
HR & Administration and Technical & Engineering employees on white wages are less likely to quit.
Employees who were hired through recommendations or jobsite resume submissions are more likely to quit, pointing to potential mismatches in these hiring routes.
Treatment Group
The model shows high instability and sparsity in the treatment group, making it difficult to derive strong conclusions. None of the factors (profession, hiring route, transportation, or psychological traits) appear to significantly affect quitting behavior for grey wage employees, apart from a marginal effect of driving to work.

Mode of transportation (trans)

We think that employees who use more convenient or subsidized transportation may have higher retention due to reduced commuting stress. Thus, we want to assess the impact of interventions like offering transporation benefits on employees with different commuting modes.

table(emp_df$trans)

## 
##  bus  car foot 
##  681  331  117

temp$treatment <- ifelse(temp$trans %in% c("bus"), 1, 0)
head(temp)

##         exp event self_gender age       field       prof          route coach
## 1  7.030801     1           m  35     finance         HR      recommend    no
## 2 22.965092     1           m  33     finance         HR jobsite_resume    no
## 3 15.934292     1           f  35      others         HR      recommend    no
## 4 15.934292     1           f  35      others         HR      recommend    no
## 5  8.410678     1           m  32      retail Commercial  jobsite_apply   yes
## 6  8.969199     1           f  42 manufacture         HR jobsite_resume   yes
##   head_gender  wage trans score_extro score_idpt score_selfctrl score_anx
## 1           f white   bus         6.2        4.1            5.7       7.1
## 2           m white   bus         6.2        4.1            5.7       7.1
## 3           m white   bus         6.2        6.2            2.6       4.8
## 4           m white   bus         5.4        7.6            4.9       2.5
## 5           f white   bus         3.0        4.1            8.0       7.1
## 6           m white   bus         6.2        6.2            4.1       5.6
##   score_innov                    prof_group treatment
## 1         8.3           HR & Administration         1
## 2         8.3           HR & Administration         1
## 3         8.3           HR & Administration         1
## 4         6.7           HR & Administration         1
## 5         3.7 Sales, Marketing & Commercial         1
## 6         6.7           HR & Administration         1

# Recreate emp_df_copy to avoid modifying the original data
temp_sub <- temp[, c("treatment", "event", "prof_group", "route", "wage", "score_selfctrl", "score_anx")]

# Fit the two-model uplift regression
uplift_model <- DualUplift(temp_sub, treat = "treatment", outcome = "event", 
                           predictors = c("prof_group", "route", "wage", "score_selfctrl", "score_anx"))
print(uplift_model)

## $call
## DualUplift(data = temp_sub, treat = "treatment", outcome = "event", 
##     predictors = c("prof_group", "route", "wage", "score_selfctrl", 
##         "score_anx"))
## 
## $control
## 
## Call:  "Control Group Logistic Model"
## 
## Coefficients:
##                             (Intercept)  
##                                 1.90918  
##           prof_groupHR & Administration  
##                                -0.54507  
##       prof_groupManagement & Leadership  
##                                -0.11562  
##                         prof_groupOther  
##                                -0.32317  
## prof_groupSales, Marketing & Commercial  
##                                -0.48353  
##             prof_groupSpecialized Roles  
##                                13.47259  
##       prof_groupTechnical & Engineering  
##                                -1.41488  
##                            routefriends  
##                                 0.16616  
##                      routejobsite_apply  
##                                -0.09732  
##                     routejobsite_resume  
##                                 0.08140  
##                          routerecommend  
##                                 0.49836  
##                               wagewhite  
##                                -0.04847  
##                          score_selfctrl  
##                                -0.14453  
##                               score_anx  
##                                -0.13492  
## 
## Degrees of Freedom: 447 Total (i.e. Null);  434 Residual
## Null Deviance:       620.6 
## Residual Deviance: 586.1     AIC: 614.1
## 
## $treatment
## 
## Call:  "Treatment Group Logistic Model"
## 
## Coefficients:
##                             (Intercept)  
##                                 1.10950  
##           prof_groupHR & Administration  
##                                -1.05165  
##       prof_groupManagement & Leadership  
##                                -0.01037  
##                         prof_groupOther  
##                                -0.79085  
## prof_groupSales, Marketing & Commercial  
##                                -0.01305  
##             prof_groupSpecialized Roles  
##                                14.35006  
##       prof_groupTechnical & Engineering  
##                                -0.85172  
##                            routefriends  
##                                -0.75990  
##                      routejobsite_apply  
##                                 0.08910  
##                     routejobsite_resume  
##                                 0.52726  
##                          routerecommend  
##                                 0.62512  
##                               wagewhite  
##                                -0.34236  
##                          score_selfctrl  
##                                 0.02366  
##                               score_anx  
##                                -0.05141  
## 
## Degrees of Freedom: 680 Total (i.e. Null);  667 Residual
## Null Deviance:       943 
## Residual Deviance: 880   AIC: 908
## 
## attr(,"class")
## [1] "print.DualUplift"

summary(uplift_model)

## $call
## DualUplift(data = temp_sub, treat = "treatment", outcome = "event", 
##     predictors = c("prof_group", "route", "wage", "score_selfctrl", 
##         "score_anx"))
## 
## $control
## 
## Call:
## "Control Group Logistic Model"
## 
## Coefficients:
##                                          Estimate Std. Error z value Pr(>|z|)
## (Intercept)                               1.90918    0.83800   2.278  0.02271
## prof_groupHR & Administration            -0.54507    0.54197  -1.006  0.31455
## prof_groupManagement & Leadership        -0.11562    0.64193  -0.180  0.85707
## prof_groupOther                          -0.32317    0.72894  -0.443  0.65752
## prof_groupSales, Marketing & Commercial  -0.48353    0.59222  -0.816  0.41423
## prof_groupSpecialized Roles              13.47259  508.16307   0.027  0.97885
## prof_groupTechnical & Engineering        -1.41488    0.64658  -2.188  0.02865
## routefriends                              0.16616    0.41949   0.396  0.69204
## routejobsite_apply                       -0.09732    0.34785  -0.280  0.77966
## routejobsite_resume                       0.08140    0.36697   0.222  0.82446
## routerecommend                            0.49836    0.31714   1.571  0.11609
## wagewhite                                -0.04847    0.33670  -0.144  0.88553
## score_selfctrl                           -0.14453    0.05171  -2.795  0.00519
## score_anx                                -0.13492    0.06301  -2.141  0.03226
##                                           
## (Intercept)                             * 
## prof_groupHR & Administration             
## prof_groupManagement & Leadership         
## prof_groupOther                           
## prof_groupSales, Marketing & Commercial   
## prof_groupSpecialized Roles               
## prof_groupTechnical & Engineering       * 
## routefriends                              
## routejobsite_apply                        
## routejobsite_resume                       
## routerecommend                            
## wagewhite                                 
## score_selfctrl                          **
## score_anx                               * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 620.62  on 447  degrees of freedom
## Residual deviance: 586.09  on 434  degrees of freedom
## AIC: 614.09
## 
## Number of Fisher Scoring iterations: 13
## 
## 
## $treatment
## 
## Call:
## "Treatment Group Logistic Model"
## 
## Coefficients:
##                                          Estimate Std. Error z value Pr(>|z|)  
## (Intercept)                               1.10950    0.78923   1.406   0.1598  
## prof_groupHR & Administration            -1.05165    0.56079  -1.875   0.0608 .
## prof_groupManagement & Leadership        -0.01037    0.64594  -0.016   0.9872  
## prof_groupOther                          -0.79085    0.71793  -1.102   0.2706  
## prof_groupSales, Marketing & Commercial  -0.01305    0.62066  -0.021   0.9832  
## prof_groupSpecialized Roles              14.35006  473.31335   0.030   0.9758  
## prof_groupTechnical & Engineering        -0.85172    0.62210  -1.369   0.1710  
## routefriends                             -0.75990    0.43469  -1.748   0.0804 .
## routejobsite_apply                        0.08910    0.37160   0.240   0.8105  
## routejobsite_resume                       0.52726    0.38035   1.386   0.1657  
## routerecommend                            0.62512    0.37808   1.653   0.0982 .
## wagewhite                                -0.34236    0.25427  -1.346   0.1782  
## score_selfctrl                            0.02366    0.04129   0.573   0.5666  
## score_anx                                -0.05141    0.04780  -1.076   0.2821  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 943.00  on 680  degrees of freedom
## Residual deviance: 879.98  on 667  degrees of freedom
## AIC: 907.98
## 
## Number of Fisher Scoring iterations: 14
## 
## 
## attr(,"class")
## [1] "summary.DualUplift"

Control group
Technical & Engineering employees and those with higher self-control or anxiety are less likely to quit among non-bus commuters.
Treatment group
HR & Administration employees who commute by bus are less likely to quit, while those hired through friends are also less likely to quit.
Conversely, bus commuters hired through recommendations are marginally more likely to quit, possibly due to unmet expectations or job mismatches.

Insights: - HR & Administration employees commuting by bus are less likely to quit, indicating stronger job satisfaction or better alignment with organizational values. - Employees with higher self-control and anxiety are less likely to quit among non-bus commuters. - For non-bus commuters, Technical & Engineering employees are less likely to quit. This group may have higher job satisfaction due to better alignment with role expectations. - Bus commuters, particularly those in HR & Administration, appear more stable. Consider expanding transportation-related support (e.g., subsidized bus passes or carpool options) to encourage bus use, as it may correlate with lower turnover.

Summary and Conclusion

Based on comprehensive analysis using various survival models (Kaplan-Meier, AFT, Cox Proportional Hazards, and Uplift models), several key factors influence employee retention, offering insights into potential strategies to reduce turnover.

Key Findings and Recap:

For industry Sector & Profession:

Employees in IT tend to stay longer than those in Finance (2.2x longer), while HR & Administration employees show lower turnover rates across all analyses.
Technical & Engineering roles show a tendency for lower turnover when employees commute by car or bus, indicating potential job satisfaction and alignment with expectations.

Hiring Source (Route to Company):

Employees hired through jobsite applications or resume submissions have a significantly higher risk of quitting (54% to double the risk), suggesting potential issues with job fit or commitment.
Referral programs show strong retention benefits, especially for HR & Administration and Technical & Engineering roles, where referred employees are significantly less likely to quit.

Wage with tax compliance:

Employees receiving white wages (tax-authorized) have a 37.5% lower risk of quitting compared to those receiving untaxed grey wages. This suggests that legal wage structures offer better job security and benefits, which enhance retention.

Mode of transportation:

Employees who walk to work tend to have shorter tenures or higher turnover. In contrast, commuters by bus—especially those in HR & Administration—appear more stable.
Car commuters among Technical & Engineering roles are less likely to quit, making this a group to consider for special retention programs or incentives.

Psychological Scores:

Employees with higher self-control are more likely to stay longer, as each unit increase in self-control decreases the risk of quitting by 6.3%. Programs focusing on self-management, goal-setting, or mentoring may help retain these employees.
Employees with higher anxiety scores are less likely to quit, possibly due to a fear of change or job insecurity. Mental health and stress management programs could further support these employees, helping to maintain or enhance their performance while reducing burnout risks.

Further insights from performing pseduo-experiment with Uplift models:

Referral programs are highly effective for HR & Administration and Technical & Engineering roles but less so for Sales, Marketing, and Management. Strengthening referral incentives for these critical roles could improve long-term retention.
Employees commuting by bus in HR & Administration roles show better job satisfaction and are less likely to quit. Transportation-related benefits like subsidized bus passes could enhance retention for this group.
Employees with higher self-control and anxiety in non-bus commuting groups are also less likely to quit. Understanding the personal circumstances of such employees and offering tailored support could further reduce turnover.

Suggestions and Recommendations:

Referral Programs: Referral hires in Technical & Engineering, HR, and Administration roles are significantly less likely to quit. Expanding and promoting referral programs can lead to longer employee tenures in these critical areas.
Wage Structure: Employees receiving white wages (tax-authorized) are more stable. We should always aim for structuring employees’ wages with tax authorized.
Hiring routes: Employees hired via job sites have significantly higher turnover rates. We should consider improving screening processes, on-boarding, or providing better pre-hire alignment to reduce mismatches. Alternatively, encourage referrals by providing more bonuses to employee throughout careful consideration for long-term retention.
Support Bus or transportation subsidized program: Employees who walk to work tend to quit more often, while bus commuters in HR roles are more stable. We should introduce some transportation benefits, like bus subsidies or carpooling options, could increase stability for commuting employees, especially in HR & Administration.
Provide mental health support: Since higher self-control leads to longer tenures, and higher anxiety is linked with lower turnover, it may affect job performance. We should but some promotions on Mental health programs, including counseling and stress management workshops, should be offered to ensure anxious employees are supported and productive.

Data-Driven Employee Retention: Leveraging Survival and Uplift Modeling for Strategic HR Insights

Quan Nguyen Hien

2024-04-21