Anytime an employee is hired, it is a massive gamble for the company. Ensuring they did the right job, they work feverishly, checking the resume, conducting interview after interview, maybe an assessment or LeetCode challenge to hire an employee, and desperate that they made the right choice out of the other 500 applicants. So much effort has gone into one person. The last thing this company needs is another employee gone, creating a vacancy to fill. So, what is it that makes an employee leave? How can we as a company predict and see what may influence an employee to leave, so before they do, we handle it and keep them satisfied? This data story will walk that path, and in the end, build a machine learning model to predict it in the future.
I aim to identify significant factors that could lead to an employee resigning, as well as paint a picture through data analytics and statistics of the differences between employees who leave and those who stay. In the end, I will develop a machine learning model that can accurately predict employee churn with a low margin of error.
head(df)
## # A tibble: 6 × 13
## left_company department job_level salary weekly_hours business_travel
## <fct> <fct> <fct> <dbl> <dbl> <fct>
## 1 Yes Sales Director 1.19e5 56 Rarely
## 2 No Sales Senior M… 8.56e4 42 Frequently
## 3 Yes Product Development Associate 4.62e4 56 Rarely
## 4 No IT and Analytics Director 1.17e5 50 Frequently
## 5 No Sales Associate 3.66e4 46 Rarely
## 6 No Marketing Senior M… 8.35e4 48 Frequently
## # ℹ 7 more variables: yrs_at_company <int>, yrs_since_promotion <int>,
## # previous_companies <dbl>, job_satisfaction <fct>, performance_rating <fct>,
## # marital_status <fct>, miles_from_home <int>
str(df)
## tibble [1,470 × 13] (S3: tbl_df/tbl/data.frame)
## $ left_company : Factor w/ 2 levels "No","Yes": 2 1 2 1 1 1 1 1 1 1 ...
## $ department : Factor w/ 6 levels "Marketing","Sales",..: 2 2 4 5 2 1 1 2 6 4 ...
## $ job_level : Factor w/ 5 levels "Associate","Manager",..: 4 3 1 4 1 3 3 4 3 4 ...
## $ salary : num [1:1470] 118681 85576 46236 117227 36635 ...
## $ weekly_hours : num [1:1470] 56 42 56 50 46 48 44 47 50 51 ...
## $ business_travel : Factor w/ 3 levels "None","Rarely",..: 2 3 2 3 2 3 2 2 3 2 ...
## $ yrs_at_company : int [1:1470] 6 10 0 8 2 7 1 1 9 7 ...
## $ yrs_since_promotion: int [1:1470] 0 1 0 3 2 3 0 0 1 7 ...
## $ previous_companies : num [1:1470] 5 5 6 1 2 1 3 3 3 3 ...
## $ job_satisfaction : Factor w/ 4 levels "Low","Medium",..: 4 2 3 3 2 4 1 3 3 3 ...
## $ performance_rating : Factor w/ 5 levels "Not Effective",..: 3 4 2 5 4 3 3 5 3 4 ...
## $ marital_status : Factor w/ 3 levels "Single","Married",..: 1 2 1 2 2 1 2 3 1 2 ...
## $ miles_from_home : int [1:1470] 1 8 2 3 2 2 3 24 23 27 ...
describe(df)
## df
##
## 13 Variables 1470 Observations
## --------------------------------------------------------------------------------
## left_company
## n missing distinct
## 1470 0 2
##
## Value No Yes
## Frequency 1233 237
## Proportion 0.839 0.161
## --------------------------------------------------------------------------------
## department
## n missing distinct
## 1470 0 6
##
## Value Marketing Sales Research
## Frequency 238 252 293
## Proportion 0.162 0.171 0.199
##
## Value Product Development IT and Analytics Finance and Operations
## Frequency 178 399 110
## Proportion 0.121 0.271 0.075
## --------------------------------------------------------------------------------
## job_level
## n missing distinct
## 1470 0 5
##
## Value Associate Manager Senior Manager Director
## Frequency 185 344 476 331
## Proportion 0.126 0.234 0.324 0.225
##
## Value Vice President
## Frequency 134
## Proportion 0.091
## --------------------------------------------------------------------------------
## salary
## n missing distinct Info Mean Gmd .05 .10
## 1470 0 1470 1 94076 40834 39204 43383
## .25 .50 .75 .90 .95
## 70379 88556 117100 129976 179829
##
## lowest : 29848.6 30488.1 30559.1 32073 32305.9
## highest: 205267 206582 208804 211621 212135
## --------------------------------------------------------------------------------
## weekly_hours
## n missing distinct Info Mean Gmd .05 .10
## 1470 0 27 0.994 50.02 5.252 44 45
## .25 .50 .75 .90 .95
## 47 49 52 58 60
##
## lowest : 40 41 42 43 44, highest: 62 63 64 65 66
## --------------------------------------------------------------------------------
## business_travel
## n missing distinct
## 1470 0 3
##
## Value None Rarely Frequently
## Frequency 150 1043 277
## Proportion 0.102 0.710 0.188
## --------------------------------------------------------------------------------
## yrs_at_company
## n missing distinct Info Mean Gmd .05 .10
## 1470 0 37 0.993 7.008 6.215 1 1
## .25 .50 .75 .90 .95
## 3 5 9 15 20
##
## lowest : 0 1 2 3 4, highest: 33 34 36 37 40
## --------------------------------------------------------------------------------
## yrs_since_promotion
## n missing distinct Info Mean Gmd .05 .10
## 1470 0 16 0.922 2.188 2.994 0 0
## .25 .50 .75 .90 .95
## 0 1 3 7 9
##
## Value 0 1 2 3 4 5 6 7 8 9 10
## Frequency 581 357 159 52 61 45 32 76 18 17 6
## Proportion 0.395 0.243 0.108 0.035 0.041 0.031 0.022 0.052 0.012 0.012 0.004
##
## Value 11 12 13 14 15
## Frequency 24 10 10 9 13
## Proportion 0.016 0.007 0.007 0.006 0.009
##
## For the frequency table, variable is rounded to the nearest 0
## --------------------------------------------------------------------------------
## previous_companies
## n missing distinct Info Mean Gmd
## 1470 0 7 0.93 3.236 1.684
##
## Value 1 2 3 4 5 6 7
## Frequency 136 374 546 98 125 119 72
## Proportion 0.093 0.254 0.371 0.067 0.085 0.081 0.049
##
## For the frequency table, variable is rounded to the nearest 0
## --------------------------------------------------------------------------------
## job_satisfaction
## n missing distinct
## 1470 0 4
##
## Value Low Medium High Very High
## Frequency 289 280 442 459
## Proportion 0.197 0.190 0.301 0.312
## --------------------------------------------------------------------------------
## performance_rating
## n missing distinct
## 1470 0 5
##
## Value Not Effective Minimally Effective Meets Expectations
## Frequency 30 136 546
## Proportion 0.020 0.093 0.371
##
## Value Exceeds Expectations Exceptional
## Frequency 472 286
## Proportion 0.321 0.195
## --------------------------------------------------------------------------------
## marital_status
## n missing distinct
## 1470 0 3
##
## Value Single Married Divorced
## Frequency 470 673 327
## Proportion 0.320 0.458 0.222
## --------------------------------------------------------------------------------
## miles_from_home
## n missing distinct Info Mean Gmd .05 .10
## 1470 0 29 0.993 9.193 8.773 1 1
## .25 .50 .75 .90 .95
## 2 7 14 23 26
##
## lowest : 1 2 3 4 5, highest: 25 26 27 28 29
## --------------------------------------------------------------------------------
Our goal is to see what makes up each portion of the population across all categorical columns
Employees who left is around 16% of the population, while this may not be an ideal amount to train and show data, if handled right this can be reflective of the mass population.
tabyl(df,left_company)
## left_company n percent
## No 1233 0.8387755
## Yes 237 0.1612245
Among the departments results seem distributed, but two stand out one being sales being so high at a wopping 31% and one being IT being as low as 8% (I will only show the codechunk for one categorical value for the sake of cleanliness)
tabyl(df,department, left_company) %>%
adorn_totals(c("row", "col")) %>%
adorn_percentages("row") %>%
adorn_pct_formatting(digits = 0) %>%
adorn_ns() %>%
adorn_title("combined")
## department/left_company No Yes Total
## Marketing 83% (198) 17% (40) 100% (238)
## Sales 69% (175) 31% (77) 100% (252)
## Research 97% (284) 3% (9) 100% (293)
## Product Development 72% (128) 28% (50) 100% (178)
## IT and Analytics 92% (369) 8% (30) 100% (399)
## Finance and Operations 72% (79) 28% (31) 100% (110)
## Total 84% (1,233) 16% (237) 100% (1,470)
Within the travel table we see a pattern that the more frequent an employee travels the more likely they are to leave.
## business_travel/left_company No Yes Total
## None 92% (138) 8% (12) 100% (150)
## Rarely 85% (887) 15% (156) 100% (1,043)
## Frequently 75% (208) 25% (69) 100% (277)
## Total 84% (1,233) 16% (237) 100% (1,470)
Here, the highest turnover rate is among associates. A theme starts to emerge: in the workforce, we know that sales roles are primarily commission-based, forcing employees to work longer hours to hit quotas. The other two groups that work longer hours are associates, who are new and need more time to grasp concepts, and those who travel frequently, as many of their hours are spent on work-related activities.
## job_level/left_company No Yes Total
## Associate 61% (113) 39% (72) 100% (185)
## Manager 73% (251) 27% (93) 100% (344)
## Senior Manager 94% (447) 6% (29) 100% (476)
## Director 92% (303) 8% (28) 100% (331)
## Vice President 89% (119) 11% (15) 100% (134)
## Total 84% (1,233) 16% (237) 100% (1,470)
Another finding supporting the same idea, that many single people make their work their entire personality, as they do not have much else in life to focus on. So this would come with using work to fill that hole.
## marital_status/left_company No Yes Total
## Single 74% (350) 26% (120) 100% (470)
## Married 88% (589) 12% (84) 100% (673)
## Divorced 90% (294) 10% (33) 100% (327)
## Total 84% (1,233) 16% (237) 100% (1,470)
An idea has formed, but before jumping to conclusions, let’s summarize and group our numerical values to see if there’s a story there
df %>%
group_by(left_company) %>%
summarise(Num_customers = n(),
AvgSalary = mean(salary),
AvgHours = mean(weekly_hours),
AvgCompanies = mean(previous_companies))
## # A tibble: 2 × 5
## left_company Num_customers AvgSalary AvgHours AvgCompanies
## <fct> <int> <dbl> <dbl> <dbl>
## 1 No 1233 97431. 48.4 2.97
## 2 Yes 237 76626. 58.6 4.65
Here we see two key points: first, those who leave on average work 10 more hours than those who stay, indicating a 60-hour work week is nearly all 7 days spent working. Second, the number of previous companies an employee has worked for is also a factor. However, this does not mean previous companies and hours are correlated. Let us dive deeper.
In the below chart, we see our idea come to life. The average hours worked by departments are highest for salespeople who travel frequently.
df %>%
group_by(department, business_travel) %>%
summarise(AvgHours = mean(weekly_hours),)
## `summarise()` has grouped output by 'department'. You can override using the
## `.groups` argument.
## # A tibble: 18 × 3
## # Groups: department [6]
## department business_travel AvgHours
## <fct> <fct> <dbl>
## 1 Marketing None 49.4
## 2 Marketing Rarely 49.9
## 3 Marketing Frequently 50.9
## 4 Sales None 49.2
## 5 Sales Rarely 51.3
## 6 Sales Frequently 53.9
## 7 Research None 49.0
## 8 Research Rarely 48.4
## 9 Research Frequently 48.7
## 10 Product Development None 48.0
## 11 Product Development Rarely 50.8
## 12 Product Development Frequently 52.5
## 13 IT and Analytics None 48.8
## 14 IT and Analytics Rarely 49.3
## 15 IT and Analytics Frequently 49.3
## 16 Finance and Operations None 48.9
## 17 Finance and Operations Rarely 52.0
## 18 Finance and Operations Frequently 53.2
Below, we have essentially converted the above table into a graph. The graph confirms our idea that long hours correlate with high turnover rates across all departments. It is not necessarily the sales department that indicates a high turnover rate; rather, it is the number of hours worked. All the blue points show us that those working 50 or more hours per week tend to leave.
ggplot(df, aes(x = salary, y = weekly_hours, color = left_company)) +
geom_point() +
facet_wrap(~ department)
Below, we see that every employee who leaves is working 50 hours or more per week across each department.
ggplot(df, aes(x= department, y = weekly_hours)) +
geom_boxplot() +
facet_wrap(~ left_company)
Here we see that the proportion of employees who leave is 30% across the three major departments with the longest average working hours.
ggplot(df, aes(x = department, fill = left_company)) +
geom_bar(position = "fill") +
labs(title = "Turnover Rate by Department", y = "Proportion", x = "Department")
Employers often claim that finances are the biggest factor in employee turnover. However, here we see that this idea is false. Employees making as much as $250,000 annually are still leaving due to long working hours.
ggplot(df, aes(x = weekly_hours, y = salary, color = left_company)) +
geom_point() +
facet_wrap(~left_company)
Below we will build a machine learning algorithm using the random forest model.
library(tidymodels)
## ── Attaching packages ────────────────────────────────────── tidymodels 1.1.1 ──
## ✔ broom 1.0.5 ✔ rsample 1.2.0
## ✔ dials 1.2.1 ✔ tune 1.2.0
## ✔ infer 1.0.6 ✔ workflows 1.1.4
## ✔ modeldata 1.3.0 ✔ workflowsets 1.0.1
## ✔ parsnip 1.2.0 ✔ yardstick 1.3.0
## ✔ recipes 1.0.10
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter() masks stats::filter()
## ✖ recipes::fixed() masks stringr::fixed()
## ✖ dplyr::lag() masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ Hmisc::src() masks dplyr::src()
## ✖ recipes::step() masks stats::step()
## ✖ Hmisc::summarize() masks dplyr::summarize()
## ✖ parsnip::translate() masks Hmisc::translate()
## • Dig deeper into tidy modeling with R at https://www.tmwr.org
library(vip)
##
## Attaching package: 'vip'
## The following object is masked from 'package:utils':
##
## vi
library(rpart.plot)
## Loading required package: rpart
##
## Attaching package: 'rpart'
## The following object is masked from 'package:dials':
##
## prune
library(ranger)
library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ranger':
##
## importance
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
set.seed(314)
df_split <- initial_split(df, prop = .8,
strata = left_company)
df_training <- df_split %>%
training()
df_testing <- df_split %>%
testing()
df_folds <- vfold_cv(df_training, v = 5)
df_recipe <- recipe(left_company ~ ., data = df_training) %>%
step_YeoJohnson(all_numeric(), -all_outcomes()) %>%
step_normalize(all_numeric(), -all_outcomes()) %>%
step_dummy(all_nominal(), -all_outcomes())
df_recipe %>%
prep() %>%
bake(new_data = df_training)
## # A tibble: 1,175 × 27
## salary weekly_hours yrs_at_company yrs_since_promotion previous_companies
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.720 0.186 0.485 0.929 -1.92
## 2 -0.142 -0.290 0.322 0.929 -1.92
## 3 0.00246 -1.49 -1.46 -1.09 0.0651
## 4 0.832 -0.556 -1.46 -1.09 0.0651
## 5 0.168 0.186 0.632 0.123 0.0651
## 6 0.844 -0.844 -0.0670 -1.09 -1.92
## 7 -0.579 -0.290 0.632 -1.09 0.0651
## 8 -1.70 -0.290 -0.0670 1.13 -0.745
## 9 0.723 -0.290 -0.963 0.123 -0.745
## 10 -0.423 -0.556 0.766 1.54 -0.745
## # ℹ 1,165 more rows
## # ℹ 22 more variables: miles_from_home <dbl>, left_company <fct>,
## # department_Sales <dbl>, department_Research <dbl>,
## # department_Product.Development <dbl>, department_IT.and.Analytics <dbl>,
## # department_Finance.and.Operations <dbl>, job_level_Manager <dbl>,
## # job_level_Senior.Manager <dbl>, job_level_Director <dbl>,
## # job_level_Vice.President <dbl>, business_travel_Rarely <dbl>, …
rf_model <- rand_forest(mtry = tune(),
trees = tune(),
min_n = tune()) %>%
set_engine('ranger', importance = "impurity") %>%
set_mode('classification')
rf_model
## Random Forest Model Specification (classification)
##
## Main Arguments:
## mtry = tune()
## trees = tune()
## min_n = tune()
##
## Engine-Specific Arguments:
## importance = impurity
##
## Computational engine: ranger
rf_workflow <- workflow() %>%
add_model(rf_model) %>%
add_recipe(df_recipe)
rf_workflow
## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: rand_forest()
##
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 3 Recipe Steps
##
## • step_YeoJohnson()
## • step_normalize()
## • step_dummy()
##
## ── Model ───────────────────────────────────────────────────────────────────────
## Random Forest Model Specification (classification)
##
## Main Arguments:
## mtry = tune()
## trees = tune()
## min_n = tune()
##
## Engine-Specific Arguments:
## importance = impurity
##
## Computational engine: ranger
set.seed(314)
rf_grid <- grid_random(mtry() %>% range_set(c(4, 12)),
trees(),
min_n(),
size = 10)
rf_tuning <- rf_workflow %>%
tune_grid(resamples = df_folds,
grid = rf_grid)
rf_tuning
## # Tuning results
## # 5-fold cross-validation
## # A tibble: 5 × 4
## splits id .metrics .notes
## <list> <chr> <list> <list>
## 1 <split [940/235]> Fold1 <tibble [30 × 7]> <tibble [0 × 3]>
## 2 <split [940/235]> Fold2 <tibble [30 × 7]> <tibble [0 × 3]>
## 3 <split [940/235]> Fold3 <tibble [30 × 7]> <tibble [0 × 3]>
## 4 <split [940/235]> Fold4 <tibble [30 × 7]> <tibble [0 × 3]>
## 5 <split [940/235]> Fold5 <tibble [30 × 7]> <tibble [0 × 3]>
rf_tuning %>% show_best(metric = 'roc_auc')
## # A tibble: 5 × 9
## mtry trees min_n .metric .estimator mean n std_err .config
## <int> <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 5 678 16 roc_auc binary 0.996 5 0.00137 Preprocessor1_Model05
## 2 6 1235 6 roc_auc binary 0.996 5 0.00132 Preprocessor1_Model03
## 3 8 1387 10 roc_auc binary 0.996 5 0.00121 Preprocessor1_Model10
## 4 9 138 14 roc_auc binary 0.996 5 0.00133 Preprocessor1_Model06
## 5 6 1841 24 roc_auc binary 0.996 5 0.00128 Preprocessor1_Model01
best_rf <- rf_tuning %>%
select_best(metric = 'roc_auc')
best_rf
## # A tibble: 1 × 4
## mtry trees min_n .config
## <int> <int> <int> <chr>
## 1 5 678 16 Preprocessor1_Model05
final_rf_workflow <- rf_workflow %>%
finalize_workflow(best_rf)
final_rf_workflow
## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: rand_forest()
##
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 3 Recipe Steps
##
## • step_YeoJohnson()
## • step_normalize()
## • step_dummy()
##
## ── Model ───────────────────────────────────────────────────────────────────────
## Random Forest Model Specification (classification)
##
## Main Arguments:
## mtry = 5
## trees = 678
## min_n = 16
##
## Engine-Specific Arguments:
## importance = impurity
##
## Computational engine: ranger
rf_wf_fit <- final_rf_workflow %>%
fit(data = df_training)
rf_fit <- rf_wf_fit %>%
pull_workflow_fit()
## Warning: `pull_workflow_fit()` was deprecated in workflows 0.2.3.
## ℹ Please use `extract_fit_parsnip()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
rf_fit
## parsnip model object
##
## Ranger result
##
## Call:
## ranger::ranger(x = maybe_data_frame(x), y = y, mtry = min_cols(~5L, x), num.trees = ~678L, min.node.size = min_rows(~16L, x), importance = ~"impurity", num.threads = 1, verbose = FALSE, seed = sample.int(10^5, 1), probability = TRUE)
##
## Type: Probability estimation
## Number of trees: 678
## Sample size: 1175
## Number of independent variables: 26
## Mtry: 5
## Target node size: 16
## Variable importance mode: impurity
## Splitrule: gini
## OOB prediction error (Brier s.): 0.02159441
Ultimalty we are correct as we see here the highest indicator for a employee leaving is weekly hours
vip(rf_fit)
rf_last_fit <- final_rf_workflow %>%
last_fit(df_split)
rf_predictions <- rf_last_fit %>%
collect_predictions()
rf_predictions
## # A tibble: 295 × 7
## .pred_class .pred_No .pred_Yes id .row left_company .config
## <fct> <dbl> <dbl> <chr> <int> <fct> <chr>
## 1 No 0.917 0.0834 train/test split 2 No Preproces…
## 2 No 0.947 0.0533 train/test split 5 No Preproces…
## 3 No 0.983 0.0167 train/test split 10 No Preproces…
## 4 No 0.999 0.00120 train/test split 29 No Preproces…
## 5 No 0.693 0.307 train/test split 38 No Preproces…
## 6 No 0.901 0.0989 train/test split 40 No Preproces…
## 7 No 0.998 0.00207 train/test split 53 No Preproces…
## 8 No 0.995 0.00548 train/test split 54 No Preproces…
## 9 No 0.875 0.125 train/test split 56 No Preproces…
## 10 No 0.998 0.00164 train/test split 57 No Preproces…
## # ℹ 285 more rows
rf_last_fit %>%
collect_metrics()
## # A tibble: 3 × 4
## .metric .estimator .estimate .config
## <chr> <chr> <dbl> <chr>
## 1 accuracy binary 0.973 Preprocessor1_Model1
## 2 roc_auc binary 0.996 Preprocessor1_Model1
## 3 brier_class binary 0.0230 Preprocessor1_Model1
rf_predictions %>%
roc_curve(truth = left_company, .pred_Yes) %>%
autoplot()
Creating a conf matric to see how many it gets correct we see that this model has a accruacy rate of 98% effectivly predicting pssible employee churm 98% of the time.
conf_mat(rf_predictions, truth = left_company, estimate = .pred_class)
## Truth
## Prediction No Yes
## No 243 4
## Yes 4 44
Within this new day and age of work, especially post-COVID-2020, we see a massive shift in how people view their jobs. Work is no longer the sole focus of their lives. Many people online express that they would take a lower salary to stay home and work rather than return to the office. In-office work comes with commute times, lunch breaks, and added exhaustion.
The overall recommendation is to look at the hours and find a way to decrease them, possibly by hiring more workers to spread out the work distribution evenly and decrease the overall workload. Employees gravitate towards where they feel stability and the ability to go home and relax. If their job becomes the primary focus of their lives, they are likely to leave. Therefore, the company should look into decreasing hours by whatever method possible.