End Goal

I aim to identify significant factors that could lead to an employee resigning, as well as paint a picture through data analytics and statistics of the differences between employees who leave and those who stay. In the end, I will develop a machine learning model that can accurately predict employee churn with a low margin of error.

Dataset Analysis

head(df)

## # A tibble: 6 × 13
##   left_company department          job_level salary weekly_hours business_travel
##   <fct>        <fct>               <fct>      <dbl>        <dbl> <fct>          
## 1 Yes          Sales               Director  1.19e5           56 Rarely         
## 2 No           Sales               Senior M… 8.56e4           42 Frequently     
## 3 Yes          Product Development Associate 4.62e4           56 Rarely         
## 4 No           IT and Analytics    Director  1.17e5           50 Frequently     
## 5 No           Sales               Associate 3.66e4           46 Rarely         
## 6 No           Marketing           Senior M… 8.35e4           48 Frequently     
## # ℹ 7 more variables: yrs_at_company <int>, yrs_since_promotion <int>,
## #   previous_companies <dbl>, job_satisfaction <fct>, performance_rating <fct>,
## #   marital_status <fct>, miles_from_home <int>

str(df)

## tibble [1,470 × 13] (S3: tbl_df/tbl/data.frame)
##  $ left_company       : Factor w/ 2 levels "No","Yes": 2 1 2 1 1 1 1 1 1 1 ...
##  $ department         : Factor w/ 6 levels "Marketing","Sales",..: 2 2 4 5 2 1 1 2 6 4 ...
##  $ job_level          : Factor w/ 5 levels "Associate","Manager",..: 4 3 1 4 1 3 3 4 3 4 ...
##  $ salary             : num [1:1470] 118681 85576 46236 117227 36635 ...
##  $ weekly_hours       : num [1:1470] 56 42 56 50 46 48 44 47 50 51 ...
##  $ business_travel    : Factor w/ 3 levels "None","Rarely",..: 2 3 2 3 2 3 2 2 3 2 ...
##  $ yrs_at_company     : int [1:1470] 6 10 0 8 2 7 1 1 9 7 ...
##  $ yrs_since_promotion: int [1:1470] 0 1 0 3 2 3 0 0 1 7 ...
##  $ previous_companies : num [1:1470] 5 5 6 1 2 1 3 3 3 3 ...
##  $ job_satisfaction   : Factor w/ 4 levels "Low","Medium",..: 4 2 3 3 2 4 1 3 3 3 ...
##  $ performance_rating : Factor w/ 5 levels "Not Effective",..: 3 4 2 5 4 3 3 5 3 4 ...
##  $ marital_status     : Factor w/ 3 levels "Single","Married",..: 1 2 1 2 2 1 2 3 1 2 ...
##  $ miles_from_home    : int [1:1470] 1 8 2 3 2 2 3 24 23 27 ...

describe(df)

## df 
## 
##  13  Variables      1470  Observations
## --------------------------------------------------------------------------------
## left_company 
##        n  missing distinct 
##     1470        0        2 
##                       
## Value         No   Yes
## Frequency   1233   237
## Proportion 0.839 0.161
## --------------------------------------------------------------------------------
## department 
##        n  missing distinct 
##     1470        0        6 
##                                                                                
## Value                   Marketing                  Sales               Research
## Frequency                     238                    252                    293
## Proportion                  0.162                  0.171                  0.199
##                                                                                
## Value         Product Development       IT and Analytics Finance and Operations
## Frequency                     178                    399                    110
## Proportion                  0.121                  0.271                  0.075
## --------------------------------------------------------------------------------
## job_level 
##        n  missing distinct 
##     1470        0        5 
##                                                                       
## Value           Associate        Manager Senior Manager       Director
## Frequency             185            344            476            331
## Proportion          0.126          0.234          0.324          0.225
##                          
## Value      Vice President
## Frequency             134
## Proportion          0.091
## --------------------------------------------------------------------------------
## salary 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1470        0     1470        1    94076    40834    39204    43383 
##      .25      .50      .75      .90      .95 
##    70379    88556   117100   129976   179829 
## 
## lowest : 29848.6 30488.1 30559.1 32073   32305.9
## highest: 205267  206582  208804  211621  212135 
## --------------------------------------------------------------------------------
## weekly_hours 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1470        0       27    0.994    50.02    5.252       44       45 
##      .25      .50      .75      .90      .95 
##       47       49       52       58       60 
## 
## lowest : 40 41 42 43 44, highest: 62 63 64 65 66
## --------------------------------------------------------------------------------
## business_travel 
##        n  missing distinct 
##     1470        0        3 
##                                            
## Value            None     Rarely Frequently
## Frequency         150       1043        277
## Proportion      0.102      0.710      0.188
## --------------------------------------------------------------------------------
## yrs_at_company 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1470        0       37    0.993    7.008    6.215        1        1 
##      .25      .50      .75      .90      .95 
##        3        5        9       15       20 
## 
## lowest :  0  1  2  3  4, highest: 33 34 36 37 40
## --------------------------------------------------------------------------------
## yrs_since_promotion 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1470        0       16    0.922    2.188    2.994        0        0 
##      .25      .50      .75      .90      .95 
##        0        1        3        7        9 
##                                                                             
## Value          0     1     2     3     4     5     6     7     8     9    10
## Frequency    581   357   159    52    61    45    32    76    18    17     6
## Proportion 0.395 0.243 0.108 0.035 0.041 0.031 0.022 0.052 0.012 0.012 0.004
##                                         
## Value         11    12    13    14    15
## Frequency     24    10    10     9    13
## Proportion 0.016 0.007 0.007 0.006 0.009
## 
## For the frequency table, variable is rounded to the nearest 0
## --------------------------------------------------------------------------------
## previous_companies 
##        n  missing distinct     Info     Mean      Gmd 
##     1470        0        7     0.93    3.236    1.684 
##                                                     
## Value          1     2     3     4     5     6     7
## Frequency    136   374   546    98   125   119    72
## Proportion 0.093 0.254 0.371 0.067 0.085 0.081 0.049
## 
## For the frequency table, variable is rounded to the nearest 0
## --------------------------------------------------------------------------------
## job_satisfaction 
##        n  missing distinct 
##     1470        0        4 
##                                                   
## Value            Low    Medium      High Very High
## Frequency        289       280       442       459
## Proportion     0.197     0.190     0.301     0.312
## --------------------------------------------------------------------------------
## performance_rating 
##        n  missing distinct 
##     1470        0        5 
##                                                                          
## Value             Not Effective  Minimally Effective   Meets Expectations
## Frequency                    30                  136                  546
## Proportion                0.020                0.093                0.371
##                                                     
## Value      Exceeds Expectations          Exceptional
## Frequency                   472                  286
## Proportion                0.321                0.195
## --------------------------------------------------------------------------------
## marital_status 
##        n  missing distinct 
##     1470        0        3 
##                                      
## Value        Single  Married Divorced
## Frequency       470      673      327
## Proportion    0.320    0.458    0.222
## --------------------------------------------------------------------------------
## miles_from_home 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1470        0       29    0.993    9.193    8.773        1        1 
##      .25      .50      .75      .90      .95 
##        2        7       14       23       26 
## 
## lowest :  1  2  3  4  5, highest: 25 26 27 28 29
## --------------------------------------------------------------------------------

Descriptive Statistics

Our goal is to see what makes up each portion of the population across all categorical columns

Employees who left is around 16% of the population, while this may not be an ideal amount to train and show data, if handled right this can be reflective of the mass population.

tabyl(df,left_company)

##  left_company    n   percent
##            No 1233 0.8387755
##           Yes  237 0.1612245

Among the departments results seem distributed, but two stand out one being sales being so high at a wopping 31% and one being IT being as low as 8% (I will only show the codechunk for one categorical value for the sake of cleanliness)

tabyl(df,department, left_company) %>% 
  adorn_totals(c("row", "col")) %>%
  adorn_percentages("row") %>% 
  adorn_pct_formatting(digits = 0) %>%
  adorn_ns() %>%
  adorn_title("combined")

##  department/left_company          No       Yes        Total
##                Marketing 83%   (198) 17%  (40) 100%   (238)
##                    Sales 69%   (175) 31%  (77) 100%   (252)
##                 Research 97%   (284)  3%   (9) 100%   (293)
##      Product Development 72%   (128) 28%  (50) 100%   (178)
##         IT and Analytics 92%   (369)  8%  (30) 100%   (399)
##   Finance and Operations 72%    (79) 28%  (31) 100%   (110)
##                    Total 84% (1,233) 16% (237) 100% (1,470)

Within the travel table we see a pattern that the more frequent an employee travels the more likely they are to leave.

##  business_travel/left_company          No       Yes        Total
##                          None 92%   (138)  8%  (12) 100%   (150)
##                        Rarely 85%   (887) 15% (156) 100% (1,043)
##                    Frequently 75%   (208) 25%  (69) 100%   (277)
##                         Total 84% (1,233) 16% (237) 100% (1,470)

Here, the highest turnover rate is among associates. A theme starts to emerge: in the workforce, we know that sales roles are primarily commission-based, forcing employees to work longer hours to hit quotas. The other two groups that work longer hours are associates, who are new and need more time to grasp concepts, and those who travel frequently, as many of their hours are spent on work-related activities.

##  job_level/left_company          No       Yes        Total
##               Associate 61%   (113) 39%  (72) 100%   (185)
##                 Manager 73%   (251) 27%  (93) 100%   (344)
##          Senior Manager 94%   (447)  6%  (29) 100%   (476)
##                Director 92%   (303)  8%  (28) 100%   (331)
##          Vice President 89%   (119) 11%  (15) 100%   (134)
##                   Total 84% (1,233) 16% (237) 100% (1,470)

Another finding supporting the same idea, that many single people make their work their entire personality, as they do not have much else in life to focus on. So this would come with using work to fill that hole.

##  marital_status/left_company          No       Yes        Total
##                       Single 74%   (350) 26% (120) 100%   (470)
##                      Married 88%   (589) 12%  (84) 100%   (673)
##                     Divorced 90%   (294) 10%  (33) 100%   (327)
##                        Total 84% (1,233) 16% (237) 100% (1,470)

An idea has formed, but before jumping to conclusions, let’s summarize and group our numerical values to see if there’s a story there

df %>% 
  group_by(left_company) %>% 
  summarise(Num_customers = n(),
            AvgSalary = mean(salary),
            AvgHours = mean(weekly_hours),
            AvgCompanies = mean(previous_companies))

## # A tibble: 2 × 5
##   left_company Num_customers AvgSalary AvgHours AvgCompanies
##   <fct>                <int>     <dbl>    <dbl>        <dbl>
## 1 No                    1233    97431.     48.4         2.97
## 2 Yes                    237    76626.     58.6         4.65

Here we see two key points: first, those who leave on average work 10 more hours than those who stay, indicating a 60-hour work week is nearly all 7 days spent working. Second, the number of previous companies an employee has worked for is also a factor. However, this does not mean previous companies and hours are correlated. Let us dive deeper.

In the below chart, we see our idea come to life. The average hours worked by departments are highest for salespeople who travel frequently.

df %>% 
  group_by(department, business_travel) %>% 
  summarise(AvgHours = mean(weekly_hours),)

## `summarise()` has grouped output by 'department'. You can override using the
## `.groups` argument.

## # A tibble: 18 × 3
## # Groups:   department [6]
##    department             business_travel AvgHours
##    <fct>                  <fct>              <dbl>
##  1 Marketing              None                49.4
##  2 Marketing              Rarely              49.9
##  3 Marketing              Frequently          50.9
##  4 Sales                  None                49.2
##  5 Sales                  Rarely              51.3
##  6 Sales                  Frequently          53.9
##  7 Research               None                49.0
##  8 Research               Rarely              48.4
##  9 Research               Frequently          48.7
## 10 Product Development    None                48.0
## 11 Product Development    Rarely              50.8
## 12 Product Development    Frequently          52.5
## 13 IT and Analytics       None                48.8
## 14 IT and Analytics       Rarely              49.3
## 15 IT and Analytics       Frequently          49.3
## 16 Finance and Operations None                48.9
## 17 Finance and Operations Rarely              52.0
## 18 Finance and Operations Frequently          53.2

Below, we have essentially converted the above table into a graph. The graph confirms our idea that long hours correlate with high turnover rates across all departments. It is not necessarily the sales department that indicates a high turnover rate; rather, it is the number of hours worked. All the blue points show us that those working 50 or more hours per week tend to leave.

ggplot(df, aes(x = salary, y = weekly_hours, color = left_company)) +
  geom_point() +
  facet_wrap(~ department)

Below, we see that every employee who leaves is working 50 hours or more per week across each department.

ggplot(df, aes(x= department, y = weekly_hours)) +
  geom_boxplot() +
  facet_wrap(~ left_company)

Here we see that the proportion of employees who leave is 30% across the three major departments with the longest average working hours.

ggplot(df, aes(x = department, fill = left_company)) +
  geom_bar(position = "fill") +
  labs(title = "Turnover Rate by Department", y = "Proportion", x = "Department")

Employers often claim that finances are the biggest factor in employee turnover. However, here we see that this idea is false. Employees making as much as $250,000 annually are still leaving due to long working hours.

ggplot(df, aes(x = weekly_hours, y = salary, color = left_company)) +
  geom_point() +
  facet_wrap(~left_company)

Below we will build a machine learning algorithm using the random forest model.

library(tidymodels)

## ── Attaching packages ────────────────────────────────────── tidymodels 1.1.1 ──

## ✔ broom        1.0.5      ✔ rsample      1.2.0 
## ✔ dials        1.2.1      ✔ tune         1.2.0 
## ✔ infer        1.0.6      ✔ workflows    1.1.4 
## ✔ modeldata    1.3.0      ✔ workflowsets 1.0.1 
## ✔ parsnip      1.2.0      ✔ yardstick    1.3.0 
## ✔ recipes      1.0.10

## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard()    masks purrr::discard()
## ✖ dplyr::filter()      masks stats::filter()
## ✖ recipes::fixed()     masks stringr::fixed()
## ✖ dplyr::lag()         masks stats::lag()
## ✖ yardstick::spec()    masks readr::spec()
## ✖ Hmisc::src()         masks dplyr::src()
## ✖ recipes::step()      masks stats::step()
## ✖ Hmisc::summarize()   masks dplyr::summarize()
## ✖ parsnip::translate() masks Hmisc::translate()
## • Dig deeper into tidy modeling with R at https://www.tmwr.org

library(vip)

## 
## Attaching package: 'vip'

## The following object is masked from 'package:utils':
## 
##     vi

library(rpart.plot)

## Loading required package: rpart

## 
## Attaching package: 'rpart'

## The following object is masked from 'package:dials':
## 
##     prune

library(ranger)
library(randomForest)

## randomForest 4.7-1.1

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:ranger':
## 
##     importance

## The following object is masked from 'package:dplyr':
## 
##     combine

## The following object is masked from 'package:ggplot2':
## 
##     margin

set.seed(314)
df_split <- initial_split(df, prop = .8,
                          strata = left_company)
df_training <- df_split %>% 
  training()

df_testing <- df_split %>% 
  testing()

df_folds <- vfold_cv(df_training, v = 5)

df_recipe <- recipe(left_company ~ ., data = df_training) %>% 
                       step_YeoJohnson(all_numeric(), -all_outcomes()) %>% 
                       step_normalize(all_numeric(), -all_outcomes()) %>% 
                       step_dummy(all_nominal(), -all_outcomes())
df_recipe %>% 
  prep() %>% 
  bake(new_data = df_training)

## # A tibble: 1,175 × 27
##      salary weekly_hours yrs_at_company yrs_since_promotion previous_companies
##       <dbl>        <dbl>          <dbl>               <dbl>              <dbl>
##  1  0.720          0.186         0.485                0.929            -1.92  
##  2 -0.142         -0.290         0.322                0.929            -1.92  
##  3  0.00246       -1.49         -1.46                -1.09              0.0651
##  4  0.832         -0.556        -1.46                -1.09              0.0651
##  5  0.168          0.186         0.632                0.123             0.0651
##  6  0.844         -0.844        -0.0670              -1.09             -1.92  
##  7 -0.579         -0.290         0.632               -1.09              0.0651
##  8 -1.70          -0.290        -0.0670               1.13             -0.745 
##  9  0.723         -0.290        -0.963                0.123            -0.745 
## 10 -0.423         -0.556         0.766                1.54             -0.745 
## # ℹ 1,165 more rows
## # ℹ 22 more variables: miles_from_home <dbl>, left_company <fct>,
## #   department_Sales <dbl>, department_Research <dbl>,
## #   department_Product.Development <dbl>, department_IT.and.Analytics <dbl>,
## #   department_Finance.and.Operations <dbl>, job_level_Manager <dbl>,
## #   job_level_Senior.Manager <dbl>, job_level_Director <dbl>,
## #   job_level_Vice.President <dbl>, business_travel_Rarely <dbl>, …

rf_model <- rand_forest(mtry = tune(),
                        trees = tune(),
                        min_n = tune()) %>% 
            set_engine('ranger', importance = "impurity") %>% 
            set_mode('classification')
rf_model

## Random Forest Model Specification (classification)
## 
## Main Arguments:
##   mtry = tune()
##   trees = tune()
##   min_n = tune()
## 
## Engine-Specific Arguments:
##   importance = impurity
## 
## Computational engine: ranger

rf_workflow <- workflow() %>% 
               add_model(rf_model) %>% 
               add_recipe(df_recipe)
rf_workflow

## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: rand_forest()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 3 Recipe Steps
## 
## • step_YeoJohnson()
## • step_normalize()
## • step_dummy()
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## Random Forest Model Specification (classification)
## 
## Main Arguments:
##   mtry = tune()
##   trees = tune()
##   min_n = tune()
## 
## Engine-Specific Arguments:
##   importance = impurity
## 
## Computational engine: ranger

set.seed(314)

rf_grid <- grid_random(mtry() %>% range_set(c(4, 12)),
                       trees(),
                       min_n(),
                       size = 10)

rf_tuning <- rf_workflow %>% 
             tune_grid(resamples = df_folds,
                       grid = rf_grid)
rf_tuning

## # Tuning results
## # 5-fold cross-validation 
## # A tibble: 5 × 4
##   splits            id    .metrics          .notes          
##   <list>            <chr> <list>            <list>          
## 1 <split [940/235]> Fold1 <tibble [30 × 7]> <tibble [0 × 3]>
## 2 <split [940/235]> Fold2 <tibble [30 × 7]> <tibble [0 × 3]>
## 3 <split [940/235]> Fold3 <tibble [30 × 7]> <tibble [0 × 3]>
## 4 <split [940/235]> Fold4 <tibble [30 × 7]> <tibble [0 × 3]>
## 5 <split [940/235]> Fold5 <tibble [30 × 7]> <tibble [0 × 3]>

rf_tuning %>% show_best(metric = 'roc_auc')

## # A tibble: 5 × 9
##    mtry trees min_n .metric .estimator  mean     n std_err .config              
##   <int> <int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
## 1     5   678    16 roc_auc binary     0.996     5 0.00137 Preprocessor1_Model05
## 2     6  1235     6 roc_auc binary     0.996     5 0.00132 Preprocessor1_Model03
## 3     8  1387    10 roc_auc binary     0.996     5 0.00121 Preprocessor1_Model10
## 4     9   138    14 roc_auc binary     0.996     5 0.00133 Preprocessor1_Model06
## 5     6  1841    24 roc_auc binary     0.996     5 0.00128 Preprocessor1_Model01

best_rf <- rf_tuning %>% 
           select_best(metric = 'roc_auc')
best_rf

## # A tibble: 1 × 4
##    mtry trees min_n .config              
##   <int> <int> <int> <chr>                
## 1     5   678    16 Preprocessor1_Model05

final_rf_workflow <- rf_workflow %>% 
                     finalize_workflow(best_rf)
final_rf_workflow

## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: rand_forest()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 3 Recipe Steps
## 
## • step_YeoJohnson()
## • step_normalize()
## • step_dummy()
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## Random Forest Model Specification (classification)
## 
## Main Arguments:
##   mtry = 5
##   trees = 678
##   min_n = 16
## 
## Engine-Specific Arguments:
##   importance = impurity
## 
## Computational engine: ranger

rf_wf_fit <- final_rf_workflow %>% 
             fit(data = df_training)
rf_fit <- rf_wf_fit %>% 
          pull_workflow_fit()

## Warning: `pull_workflow_fit()` was deprecated in workflows 0.2.3.
## ℹ Please use `extract_fit_parsnip()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

rf_fit

## parsnip model object
## 
## Ranger result
## 
## Call:
##  ranger::ranger(x = maybe_data_frame(x), y = y, mtry = min_cols(~5L,      x), num.trees = ~678L, min.node.size = min_rows(~16L, x),      importance = ~"impurity", num.threads = 1, verbose = FALSE,      seed = sample.int(10^5, 1), probability = TRUE) 
## 
## Type:                             Probability estimation 
## Number of trees:                  678 
## Sample size:                      1175 
## Number of independent variables:  26 
## Mtry:                             5 
## Target node size:                 16 
## Variable importance mode:         impurity 
## Splitrule:                        gini 
## OOB prediction error (Brier s.):  0.02159441

Ultimalty we are correct as we see here the highest indicator for a employee leaving is weekly hours

vip(rf_fit)

rf_last_fit <- final_rf_workflow %>% 
               last_fit(df_split)
rf_predictions <- rf_last_fit %>% 
  collect_predictions() 
rf_predictions

## # A tibble: 295 × 7
##    .pred_class .pred_No .pred_Yes id                .row left_company .config   
##    <fct>          <dbl>     <dbl> <chr>            <int> <fct>        <chr>     
##  1 No             0.917   0.0834  train/test split     2 No           Preproces…
##  2 No             0.947   0.0533  train/test split     5 No           Preproces…
##  3 No             0.983   0.0167  train/test split    10 No           Preproces…
##  4 No             0.999   0.00120 train/test split    29 No           Preproces…
##  5 No             0.693   0.307   train/test split    38 No           Preproces…
##  6 No             0.901   0.0989  train/test split    40 No           Preproces…
##  7 No             0.998   0.00207 train/test split    53 No           Preproces…
##  8 No             0.995   0.00548 train/test split    54 No           Preproces…
##  9 No             0.875   0.125   train/test split    56 No           Preproces…
## 10 No             0.998   0.00164 train/test split    57 No           Preproces…
## # ℹ 285 more rows

rf_last_fit %>% 
  collect_metrics()

## # A tibble: 3 × 4
##   .metric     .estimator .estimate .config             
##   <chr>       <chr>          <dbl> <chr>               
## 1 accuracy    binary        0.973  Preprocessor1_Model1
## 2 roc_auc     binary        0.996  Preprocessor1_Model1
## 3 brier_class binary        0.0230 Preprocessor1_Model1

rf_predictions %>% 
                roc_curve(truth  = left_company, .pred_Yes) %>% 
                autoplot()

Creating a conf matric to see how many it gets correct we see that this model has a accruacy rate of 98% effectivly predicting pssible employee churm 98% of the time.

conf_mat(rf_predictions, truth = left_company, estimate = .pred_class)

##           Truth
## Prediction  No Yes
##        No  243   4
##        Yes   4  44

Employee Churn

Daniyal Zafar

2024-06-26

Business Problem

End Goal

Dataset Analysis

Descriptive Statistics

Reccomendation