Data Challenge 3: Employee Retention

Executive Summary

The factors that most strongly drive employee churn are salary and years experience upon hire. Generally, employees who make less money and who have less experience upon being hired are more likely to quit. Suggestions for increasing employee retention are:

Identify at-risk employees. Employees that are more junior when hired and employees earning a relatively low salary are more likely to quit.
Increase salaries wherever possible. Revisit salaries often to ensure they remain competitive.
You can expect that employees entering with less experience are more likely to quit. Especially increase salaries for less experienced employees who you want to keep
There is some evidence that employees are more likely to leave after working with a company for 1 year or 2 years. Provide incentives to entice employees to stay through these benchmarks.
Other factors not included in this analysis likely play a strong role in employee churn. If possible, collect data on employee satisfaction to enable better predictions for retention

This report is available at: http://rpubs.com/shannon_carter/542995

Prep Workspace

## Set working directory and clear memory
setwd("/Users/shannoncarter/Documents/Insight/DataChallenges/DC3")
rm(list = ls(all = T))

## Load required packages
library(tidyverse)    # for data manipulation and plotting
library(lubridate)    # for handling dates
library(rattle)       # graphical interface for data science in R
library(rpart.plot)   # decision tree model and plot
library(caret)        # data viz, specifically for ML
library(randomForest)
library(e1071)

## Load universal plotting elements
mytheme <- theme_bw(base_size = 15) +
  theme(legend.text = element_text(size = 10),
        legend.title = element_text(size = 11),
        text = element_text(size = 14),     
        axis.title = element_text(size = 12),
        axis.text  = element_text(size = 10),
        panel.grid = element_blank())

Load Data

Read in data (.csv format), and look at it’s head and structure.

# read in data 
df = read.csv("employee_retention.csv")

# check out the first few lines of data
head(df)

##   X employee_id company_id            dept seniority salary join_date
## 1 0     1001444          8 temp_contractor         0   5850   1/26/08
## 2 1      388804          8          design        21 191000   5/17/11
## 3 2      407990          3          design         9  90000   3/26/12
## 4 3      120657          2        engineer        20 298000    4/8/13
## 5 4     1006393          1 temp_contractor         0   8509   7/20/08
## 6 5      287530          5       marketing        20 180000   6/30/14
##   quit_date
## 1   4/25/08
## 2   3/16/12
## 3   4/10/15
## 4   1/30/15
## 5  10/18/08
## 6

# look at data types
str(df)

## 'data.frame':    34702 obs. of  8 variables:
##  $ X          : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ employee_id: int  1001444 388804 407990 120657 1006393 287530 561043 702479 545690 622587 ...
##  $ company_id : int  8 8 3 2 1 5 3 7 10 5 ...
##  $ dept       : Factor w/ 7 levels "customer_service",..: 7 3 3 4 7 5 1 2 2 6 ...
##  $ seniority  : int  0 21 9 20 0 20 18 7 16 28 ...
##  $ salary     : int  5850 191000 90000 298000 8509 180000 119000 140000 238000 166000 ...
##  $ join_date  : Factor w/ 2339 levels "1/1/07","1/1/08",..: 111 1414 1080 1344 1830 1701 1825 704 680 1750 ...
##  $ quit_date  : Factor w/ 2009 levels "","1/1/08","1/1/09",..: 1105 893 1023 140 242 1 965 1814 1 1 ...

Clean Data

Looking at the raw data, it’s clear we have to do some basic clean-up. Things I can see immediately:

remove X index row
make company_id a factor variable
remove error entries where ‘seniority’ is 100
add a binary column for quit? 0/1 = quit/didn’t
fill in blank cells for quit_date with “NA”
make date columns as type “date”
add variable ‘time with company’

More things will likely arise, but let’s first tackle these issues.

# 1. drop the index column
df <- select(df, -X)

# 2. make company_id a factor. even though it's a number, it's really a categorical variable
df$company_id <- as.factor(df$company_id)

# 3. set seniority to NA if currently it seems wrong
df$seniority <- na_if(df$seniority, 99)
df$seniority <- na_if(df$seniority, 98)

# 4. add a binary column for quit
df$quit <- ifelse(df$quit_date == "", 0, 1)

# 5. fill in blank cells for quit date with "NA"
df$quit_date <- na_if(df$quit_date, "")

# 6. make date columns as type 'date'
df$join_date <- mdy(df$join_date)
df$quit_date <- mdy(df$quit_date)

# 7. add variable time with company
df$time_with_company <- as.double(difftime(df$quit_date, df$join_date, units = "days"))

Now, check the data again. Looks good!

head(df)

##   employee_id company_id            dept seniority salary  join_date
## 1     1001444          8 temp_contractor         0   5850 2008-01-26
## 2      388804          8          design        21 191000 2011-05-17
## 3      407990          3          design         9  90000 2012-03-26
## 4      120657          2        engineer        20 298000 2013-04-08
## 5     1006393          1 temp_contractor         0   8509 2008-07-20
## 6      287530          5       marketing        20 180000 2014-06-30
##    quit_date quit time_with_company
## 1 2008-04-25    1                90
## 2 2012-03-16    1               304
## 3 2015-04-10    1              1110
## 4 2015-01-30    1               662
## 5 2008-10-18    1                90
## 6       <NA>    0                NA

str(df)

## 'data.frame':    34702 obs. of  9 variables:
##  $ employee_id      : int  1001444 388804 407990 120657 1006393 287530 561043 702479 545690 622587 ...
##  $ company_id       : Factor w/ 12 levels "1","2","3","4",..: 8 8 3 2 1 5 3 7 10 5 ...
##  $ dept             : Factor w/ 7 levels "customer_service",..: 7 3 3 4 7 5 1 2 2 6 ...
##  $ seniority        : int  0 21 9 20 0 20 18 7 16 28 ...
##  $ salary           : int  5850 191000 90000 298000 8509 180000 119000 140000 238000 166000 ...
##  $ join_date        : Date, format: "2008-01-26" "2011-05-17" ...
##  $ quit_date        : Date, format: "2008-04-25" "2012-03-16" ...
##  $ quit             : num  1 1 1 1 1 0 1 1 0 0 ...
##  $ time_with_company: num  90 304 1110 662 90 NA 634 612 NA NA ...

Inspect data

Calculate some important metrics, for reference

# how many employee records do we have?
length(df$quit)

## [1] 34702

# calculate the proportion of employees in our dataset who have quit
sum(df$quit)/length(df$quit)

## [1] 0.6774826

summary_stats <- df %>%
  group_by(quit) %>% 
  summarize(mean_salary = mean(salary, na.rm = T),
            mean_twc = mean(time_with_company, na.rm = T),
            mean_seniority = mean(seniority, na.rm = T))
summary_stats

## # A tibble: 2 x 4
##    quit mean_salary mean_twc mean_seniority
##   <dbl>       <dbl>    <dbl>          <dbl>
## 1     0     141192.     NaN           14.1 
## 2     1      80884.     391.           8.11

Above, we see that 67% of the 34,702 employees in our dataset have quit the company. Employees who quit got paid a lot less than those who stayed. They also had less experience prior to starting the job.

Diagnostic plots

Make some diagonsitic plots to get familiar with the data and check the data for outliers or other weirdness.

Salaries by company

ggplot(df, aes(x = company_id, y = salary)) + mytheme +
  geom_boxplot()

Salary-seniority trend, by company

This is pretty neat– generally, people with more experience are paid more. This often follows a stepwise pattern, with pay bumps associated with 5 year increments of experience.

There are very few records for companies 11 and 12. I’ll have to keep this in mind when making claims or predictions at the company level.

ggplot(df, aes(x = seniority, y = salary)) + mytheme +
  geom_point(alpha = 0.2) + geom_smooth() +
  facet_wrap(~company_id, scales = 'free')

Employee lifetime by company

Some companies (1, 2, 3, and 4) seem to have better retention. Maybe also 11 and 12, but hesitant to say because of so few observations.

ggplot(df, aes(x = company_id, y = time_with_company)) + mytheme +
  geom_boxplot()

Employee lifetime by department

Temp contractors by definition will have a very short lifetime with any company. Otherwise, employee lifetime doesn’t vary a whole lot across departments.

ggplot(df, aes(x = dept, y = time_with_company)) + mytheme +
  geom_boxplot()

Employee lifetime by salary

I’d expect there to be more of a correlation here… but one interesting takeaway from this plot is that employees seem likely to leave after about 1 year or 2 years with a company.

To note, I removed temporary contractors in this plot, since their “time with company” is not really comparable with the other departments.

The trends by department aren’t super interpretable here, but we definitely can see some separation in salary on the basis of department.

ggplot(subset(df, subset = (dept != "temp_contractor")), aes(x = salary, y = time_with_company, color = dept)) + mytheme +
  geom_point(alpha = 0.25) +
  geom_hline(aes(yintercept = 365), linetype = "dashed", color = "blue", size = 1) +
  geom_hline(aes(yintercept = 365*2), linetype = "dashed", color = "blue", size = 1)

Salary by department

Just because I am interested.

ggplot(df, aes(x = dept, y = salary)) + mytheme +
  geom_boxplot()

Turnover by department

By definition, temporary contractors quit, always. Starting to think I should remove them from this dataset… Otherwise, we don’t see any real trends for turnover frequency by department. For most departments, there’s a pretty even balance between employees retained and employees left.

ggplot(df, aes(x = dept, fill = as.factor(quit))) + mytheme +
  geom_bar(position=position_dodge())

### Turnover frequency by company

No obvious trends in turnover by company. Removed temporary contractors in this plot. In doing so, we see that we have more records for current employees than those who have quit.

ggplot(subset(df, subset = (dept != "temp_contractor")), aes(x = company_id, fill = as.factor(quit))) + mytheme +
  geom_bar(position=position_dodge())

Modelling

First, pull out the features we’ll use and divide data into train and test sets.

df$quit <- as.factor(df$quit)

# select variables to be included in the model
features <- c("company_id", "dept", "seniority", "salary", "quit")

## partition data into training and test sets
# 80/20 train/test
smp_size <- floor(0.80 * nrow(df))

## set the seed to make partition reproducible
set.seed(123)
train_ind <- sample(seq_len(nrow(df)), size = smp_size)

df_train <- df[train_ind, ]
df_test <- df[-train_ind, ]

Decision Tree

This decision tree tells us that the most important factors leading to an employee quitting are salary and seniority. Employees making over $247k are very unlikely to quit

rpart_model <- rpart(quit ~.,
                     data = df_train[features],
                     method = 'class',
                     parms = list(split='information'),
                     control = rpart.control(usesurrogate = 0,
                                             maxsurrogate = 0))
# Plot the decision tree
rpart.plot(rpart_model, roundint = FALSE, type = 3)

Random forest model

Salary, seniority, and department are important. This corroborates what we see with the decision tree and other plots above.

set.seed(222)
df_RF <- randomForest(quit~ .,
                           data = df_train[features],
                           ntree=500, importance = TRUE,
                           na.action = na.omit)
varImpPlot(df_RF,type=1,
           main="Variable Importance (Accuracy)",
           sub = "Random Forest Model")

var_importance <-importance(df_RF)
df_RF

## 
## Call:
##  randomForest(formula = quit ~ ., data = df_train[features], ntree = 500,      importance = TRUE, na.action = na.omit) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 34.33%
## Confusion matrix:
##      0     1 class.error
## 0 3454  5442   0.6117356
## 1 4022 14646   0.2154489

Model Metrics

67% accuracy. Given the few features, this is pretty ok! We’re doing better than chance at identifying employees at risk of quitting. Recall = 79%, meaning 79% of employees who resigned were correctly predicted by the model. Precision = 74%, meaning 74% of those identified as ‘at risk’ actually quit.

# generate predictions based on test data ("emp_test")
df_RF_pred <- predict(df_RF, newdata = df_test)
cm <- confusionMatrix(data = df_RF_pred,
                reference = df_test$quit,
                positive = "1", mode = "prec_recall")
cm

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0  902  977
##          1 1286 3732
##                                          
##                Accuracy : 0.6719         
##                  95% CI : (0.6607, 0.683)
##     No Information Rate : 0.6828         
##     P-Value [Acc > NIR] : 0.9744         
##                                          
##                   Kappa : 0.2128         
##                                          
##  Mcnemar's Test P-Value : 9.511e-11      
##                                          
##               Precision : 0.7437         
##                  Recall : 0.7925         
##                      F1 : 0.7673         
##              Prevalence : 0.6828         
##          Detection Rate : 0.5411         
##    Detection Prevalence : 0.7276         
##       Balanced Accuracy : 0.6024         
##                                          
##        'Positive' Class : 1              
##

Confusion Matrix

fourfoldplot(cm$table)