library(dplyr)
library(DBI)
library(plotly)
library(caret)
library(partykit)
library(parsnip)
library(workflows)
library(yardstick)
library(vip)
library(crosstalk)
library(xgboost)
library(randomForest)
library(rsconnect)

Project Aim

Identify the factors influencing an employee’s departure from the company in order to identify weaknesses in corporate policy that can potentially be improved.

Analysis

Data and Analysis Logic

Connecting to the database.

The relationship between the employee’s decision to leave/stay (Attrition) and the position held (JobRole) was considered. For further analysis, those categories where the percentage of employee outflow exceeded 17% were selected - namely representatives of positions such as ‘Human Resources’, ‘Sales Representative’, ‘Laboratory Technician’, ’Sales Executive’, ‘Research Scientist’.

This category was chosen for the analysis, as it shows the division between different positions. In a number of them (‘Manager’, ‘Research Director’, ‘Manufacturing Director’, ‘Healthcare Representative’), the percentage of outflow is relatively lower and does not exceed 10%. This may indicate the presence of some factors that influence the desire of employees to leave their positions in the analyzed categories.

# who stay
stay = dbGetQuery(con, "SELECT JobRole, COUNT(*) AS stay
                         FROM employee.portfolio
                         WHERE Attrition = '0'
                         GROUP BY JobRole")
# who leave
leave = dbGetQuery(con, "SELECT JobRole, COUNT(*) AS leave
                         FROM employee.portfolio
                         WHERE Attrition = '1'
                         GROUP BY JobRole")

temp = stay %>% inner_join(leave, by = "JobRole") %>% mutate(prop = round(leave/(stay + leave), 2)) %>% arrange(-prop)
temp %>% knitr::kable(col.names = c("Job Role", "Stay", "Leave", "Share") )
Job Role Stay Leave Share
Sales Representative 33 25 0.43
Laboratory Technician 139 49 0.26
Human Resources 24 6 0.20
Research Scientist 176 38 0.18
Sales Executive 193 39 0.17
Healthcare Representative 80 8 0.09
Manufacturing Director 100 7 0.07
Manager 74 5 0.06
Research Director 60 2 0.03

The graph also depicts how the shares of employees who stayed or left their jobs were distributed within the positions.

fig = plot_ly(
  labels = c("Total", 
             "Sales Representative", "Laboratory Technician",  "Human Resources", "Research Scientist", "Sales Executive", "Healthcare Representative", "Manufacturing Director", "Manager", "Research Director", 
             "Stay", "Leave", "Stay", "Leave", "Stay", "Leave", "Stay", "Leave", "Stay", "Leave", "Stay", "Leave", "Stay", "Leave", "Stay", "Leave", "Stay", "Leave"),
  parents = c("", 
              "Total", "Total", "Total", "Total", "Total", "Total", "Total", "Total", "Total", 
              "Sales Representative", "Sales Representative", "Laboratory Technician", "Laboratory Technician", "Human Resources", "Human Resources", "Research Scientist", "Research Scientist", "Sales Executive", "Sales Executive", "Healthcare Representative", "Healthcare Representative", "Manufacturing Director", "Manufacturing Director", "Manager", "Manager", "Research Director", "Research Director"),
  values = c(sum(temp$stay)+sum(temp$leave), 
             temp$stay[temp$JobRole == "Sales Representative"]+temp$leave[temp$JobRole == "Sales Representative"],
             temp$stay[temp$JobRole == "Laboratory Technician"]+temp$leave[temp$JobRole == "Laboratory Technician"],
             temp$stay[temp$JobRole == "Human Resources"]+temp$leave[temp$JobRole == "Human Resources"],
             temp$stay[temp$JobRole == "Research Scientist"]+temp$leave[temp$JobRole == "Research Scientist"],
             temp$stay[temp$JobRole == "Sales Executive"]+temp$leave[temp$JobRole == "Sales Executive"],
             temp$stay[temp$JobRole == "Healthcare Representative"]+temp$leave[temp$JobRole == "Healthcare Representative"],
             temp$stay[temp$JobRole == "Manufacturing Director"]+temp$leave[temp$JobRole == "Manufacturing Director"],
             temp$stay[temp$JobRole == "Manager"]+temp$leave[temp$JobRole == "Manager"],
             temp$stay[temp$JobRole == "Research Director"]+temp$leave[temp$JobRole == "Research Director"],
             temp$stay[temp$JobRole == "Sales Representative"],
             temp$leave[temp$JobRole == "Sales Representative"],
             temp$stay[temp$JobRole == "Laboratory Technician"],
             temp$leave[temp$JobRole == "Laboratory Technician"],
             temp$stay[temp$JobRole == "Human Resources"],
             temp$leave[temp$JobRole == "Human Resources"],
             temp$stay[temp$JobRole == "Research Scientist"],
             temp$leave[temp$JobRole == "Research Scientist"],
             temp$stay[temp$JobRole == "Sales Executive"],
             temp$leave[temp$JobRole == "Sales Executive"],
             temp$stay[temp$JobRole == "Healthcare Representative"],
             temp$leave[temp$JobRole == "Healthcare Representative"],
             temp$stay[temp$JobRole == "Manufacturing Director"],
             temp$leave[temp$JobRole == "Manufacturing Director"],
             temp$stay[temp$JobRole == "Manager"],
             temp$leave[temp$JobRole == "Manager"],
             temp$stay[temp$JobRole == "Research Director"],
             temp$leave[temp$JobRole == "Research Director"]),
  branchvalues = 'total',
  type = 'sunburst'
)
fig

The data has been downloaded from the database. Combined by ‘employeeNumber’ for each employee. Filtered by position. The text variables, as well as the dependent variable (Attrition), have been converted to factor variables.

data = dbGetQuery(con, "SELECT Attrition, BusinessTravel, Department, EnvironmentSatisfaction, JobInvolvement, JobSatisfaction, MonthlyIncome, OverTime, PercentSalaryHike, PerformanceRating, RelationshipSatisfaction, TrainingTimesLastYear, YearsAtCompany, YearsInCurrentRole, YearsSinceLastPromotion, YearsWithCurrManager, Age, Education, EduFieldId, Gender, MaritalStatus, NumCompaniesWorked, TotalWorkingYears, WorkLifeBalance, DistanceFromHome, JobRole FROM employee.portfolio t1
           INNER JOIN employee.profile t2 ON t1.EmployeeNumber = t2.EmployeeNumber
           WHERE JobRole = 'Human Resources' OR JobRole = 'Sales Representative' OR JobRole = 'Laboratory Technician' OR JobRole = 'Sales Executive' OR JobRole = 'Research Scientist'")

data = data %>% mutate_if(is.character, as.factor)
data$Attrition = as.factor(data$Attrition)

dbDisconnect(con)

Model

To further build the model, the data was divided into training and test samples.

set.seed(123)
data.prt = createDataPartition(data$Attrition, p = 0.8, list = F, times = 1)
train = data[data.prt,]
test = data[-data.prt,]

Three models were built to predict the target variable - whether an employee left or stayed in the company. The quality of their prediction is also assessed. The results of the accuracy of model predictions on training and test samples, respectively, are presented.

1. Decision Tree

treemodel = ctree(Attrition~., data = train)

predTrain = predict(treemodel, train)
predTest = predict(treemodel, test)

accuracyTrain.tr =  
  accuracy_vec(train$Attrition, predTrain)
accuracyTest.tr =  
  accuracy_vec(test$Attrition, predTest)

table.tree = rbind(accuracyTrain.tr, accuracyTest.tr)
colnames(table.tree) = c("Accuracy")
rownames(table.tree) = c("Training Sample", "Test Sample")
table.tree %>% knitr::kable()
Accuracy
Training Sample 0.8044983
Test Sample 0.7777778

2. Gradient Boosting

set.seed(3)
xgb = boost_tree(mode = "classification", mtry = 3) %>% 
  set_engine('xgboost')

wf_xgb = workflow() %>% 
  add_model(xgb) %>% 
  add_formula(Attrition~.) %>% 
  fit(train)

predtrain.xgb = predict(wf_xgb, train) 
predtest.xgb = predict(wf_xgb, test)

accuracyTrain.xgb =  
  accuracy_vec(train$Attrition, predtrain.xgb$.pred_class)

accuracyTest.xgb =  
  accuracy_vec(test$Attrition, predtest.xgb$.pred_class)

table.xgb = rbind(accuracyTrain.xgb, accuracyTest.xgb)
colnames(table.xgb) = c("Accuracy")
rownames(table.xgb) = c("Training Sample", "Test Sample")
table.xgb %>% knitr::kable()
Accuracy
Training Sample 0.9013841
Test Sample 0.7986111

3. Random Forest

rf = rand_forest(mode = "classification", trees = 5, mtry = 2) %>% 
  set_engine('randomForest')

wf_rf = workflow() %>% 
  add_model(rf) %>% 
  add_formula(Attrition~.) %>% 
  fit(train)

predtrain.rf = predict(wf_rf, train)
predtest.rf = predict(wf_rf, test)

accuracyTrain.rf =  
  accuracy_vec(train$Attrition, predtrain.rf$.pred_class)

accuracyTest.rf =  
  accuracy_vec(test$Attrition, predtest.rf$.pred_class)

table.rf = rbind(accuracyTrain.rf, accuracyTest.rf)
colnames(table.rf) = c("Accuracy")
rownames(table.rf) = c("Training Sample", "Test Sample")
table.rf %>% knitr::kable()
Accuracy
Training Sample 0.9705882
Test Sample 0.7361111

4. Logistic regression

logreg <- logistic_reg()

log.wf <- workflow() %>%
   add_formula(Attrition ~.) %>%
   add_model(logreg) %>%
   fit(train)

predtrain.logreg  = log.wf %>% predict(train)

accuracyTrain.logreg = predtrain.logreg %>% 
   cbind(train) %>%  
   accuracy(Attrition, .pred_class) %>% 
   dplyr::select(.estimate)

predtest.logreg  = log.wf %>% predict(test)

accuracyTest.logreg = predtest.logreg %>% 
    cbind(test) %>% 
    accuracy(Attrition, .pred_class) %>% 
    dplyr::select(.estimate)

table.log = rbind(accuracyTrain.logreg$.estimate, accuracyTest.logreg$.estimate)
colnames(table.log) = c("Accuracy")
rownames(table.log) = c("Training Sample", "Test Sample")
table.log %>% knitr::kable()
Accuracy
Training Sample 0.8892734
Test Sample 0.8402778

As the evaluation of the quality of model predictions shows, the gradient boost model and the random forest model have been retrained, which is confirmed by the large difference in accuracy between the test and training samples. The decision tree model and the logistic regression model lack this disadvantage, since their accuracy indices on the test and training samples is almost identical. For further work, the logistic regression model was chosen, as its index is higher.

Simulation

In order to propose changes, let’s identify the factors that have the greatest impact on an employee’s decision to leave the company according to the logistic regression model.

vip(log.wf)

The three most influential are:
- Availability of recycling
- Family status - single
- The number of years since the last promotion.

In the graph below, you can see how parameters such as the time since the last promotion, marital status, and monthly income affect the proportion of employees who left/or remained in the company, depending on their position.

data = data %>% mutate(LastPromFactor = ifelse(YearsSinceLastPromotion >= 5, "More than 5", "Less than 5")) %>% mutate(Attrition.name = ifelse(Attrition == 1, "Leave", "Stay"))
data$LastPromFactor = as.factor(data$LastPromFactor)
data$Attrition.name = as.factor(data$Attrition.name)
sharedData <- SharedData$new(data)


bscols(widths = c(3,NA),
       list(filter_checkbox("last_prom", "Years since the last promotion", sharedData , ~LastPromFactor),
            filter_checkbox("overtime", "Overtime work", sharedData , ~OverTime),
            filter_select("marital_status", "Marital status", sharedData, ~MaritalStatus),
            filter_slider("charges", "Monthly income", sharedData, ~MonthlyIncome, max = 13872)),
       plot_ly(sharedData, x = ~MonthlyIncome, y = ~JobRole, color = ~Attrition.name, colors = c("red", "blue"), alpha = 0.5, type = "bar"))

Since the campaign cannot affect the family status, two other indicators will be offered for simulation.

Let’s assume that we have redistributed the workload by reducing overwork for 20% of employees. Let’s simulate this: in 20% of cases that initially stated recycling, we will change this indicator to ‘no’. Let’s compare the shares of employees who stayed/left.

test2 = test
test2$OverTime[test2$OverTime == "Yes"] = 
  sample(c("Yes", "No"), 
         size = length(test2$OverTime[test2$OverTime == "Yes"]),
         replace = T, prob = c(0.8, 0.2))

predTest = predict(log.wf, test2)

ggplot() + geom_bar(data = data.frame(predTest), aes(x = predTest$.pred_class), alpha = 0.5, fill = "red") + geom_bar(data = predtest.logreg, aes(x = predtest.logreg$.pred_class), alpha = 0.5)+
  ggtitle("Comparison of the shares of employees who signed up/left after the simulation")+
  xlab("")+
  ylab("Quantity")+
  scale_x_discrete(labels = c("Stay", "Leave"))+
  theme_minimal()

As we can see, the changes are minor, but they are present. If we shift the proportion more strongly in our simulation, the result will be more noticeable.

Let’s assume that we decide to promote people who have been promoted for more than five years. Let’s simulate this: for employees who have more than 5 years since their last promotion, replace it with 0, assuming that they have received a promotion. Let’s compare the shares of users who stayed/left.

test3 = test
test3$YearsSinceLastPromotion[test3$YearsSinceLastPromotion >=5] = 0
test3$YearsSinceLastPromotion = as.integer(test3$YearsSinceLastPromotion)

predTest = predict(log.wf, test3)


ggplot() + geom_bar(data = data.frame(predTest), aes(x = predTest$.pred_class), alpha = 0.5, fill = "red") + geom_bar(data = predtest.logreg, aes(x = predtest.logreg$.pred_class), alpha = 0.5)+
  ggtitle("Comparison of the shares of employees who signed up/left after the simulation")+
  xlab("")+
  ylab("Quantity")+
  scale_x_discrete(labels = c("Stay", "Leave"))+
  theme_minimal()

Promotion has an impact on the desire of employees to keep their place in the company, but it is weak.

Conclusion

During the analysis, it was revealed that there is a difference in the percentage of employee outflow between positions. Categories with an outflow of more than 17% were considered. According to the logistic regression model, the three most influential factors determining an employee’s desire to leave the company are: overworking, marital status, and the number of years since the last promotion. During the simulation, it was found out that if we redistribute the workload between employees, which would free 20% of employees from overwork, it would contribute to a slight decrease in the proportion of those who want to leave the company. The same effect was observed when promoting people who had been promoted for more than five years. This suggests that, probably, such changes in real life would lead to a greater expenditure of resources than to a real effect. In this regard, I recommend continuing further analysis in order to find the most influential factors.