The factors that most strongly drive employee churn are salary and years experience upon hire. Generally, employees who make less money and who have less experience upon being hired are more likely to quit. Suggestions for increasing employee retention are:
This report is available at: http://rpubs.com/shannon_carter/542995
## Set working directory and clear memory
setwd("/Users/shannoncarter/Documents/Insight/DataChallenges/DC3")
rm(list = ls(all = T))
## Load required packages
library(tidyverse) # for data manipulation and plotting
library(lubridate) # for handling dates
library(rattle) # graphical interface for data science in R
library(rpart.plot) # decision tree model and plot
library(caret) # data viz, specifically for ML
library(randomForest)
library(e1071)
## Load universal plotting elements
mytheme <- theme_bw(base_size = 15) +
theme(legend.text = element_text(size = 10),
legend.title = element_text(size = 11),
text = element_text(size = 14),
axis.title = element_text(size = 12),
axis.text = element_text(size = 10),
panel.grid = element_blank())
Read in data (.csv format), and look at it’s head and structure.
# read in data
df = read.csv("employee_retention.csv")
# check out the first few lines of data
head(df)
## X employee_id company_id dept seniority salary join_date
## 1 0 1001444 8 temp_contractor 0 5850 1/26/08
## 2 1 388804 8 design 21 191000 5/17/11
## 3 2 407990 3 design 9 90000 3/26/12
## 4 3 120657 2 engineer 20 298000 4/8/13
## 5 4 1006393 1 temp_contractor 0 8509 7/20/08
## 6 5 287530 5 marketing 20 180000 6/30/14
## quit_date
## 1 4/25/08
## 2 3/16/12
## 3 4/10/15
## 4 1/30/15
## 5 10/18/08
## 6
## 'data.frame': 34702 obs. of 8 variables:
## $ X : int 0 1 2 3 4 5 6 7 8 9 ...
## $ employee_id: int 1001444 388804 407990 120657 1006393 287530 561043 702479 545690 622587 ...
## $ company_id : int 8 8 3 2 1 5 3 7 10 5 ...
## $ dept : Factor w/ 7 levels "customer_service",..: 7 3 3 4 7 5 1 2 2 6 ...
## $ seniority : int 0 21 9 20 0 20 18 7 16 28 ...
## $ salary : int 5850 191000 90000 298000 8509 180000 119000 140000 238000 166000 ...
## $ join_date : Factor w/ 2339 levels "1/1/07","1/1/08",..: 111 1414 1080 1344 1830 1701 1825 704 680 1750 ...
## $ quit_date : Factor w/ 2009 levels "","1/1/08","1/1/09",..: 1105 893 1023 140 242 1 965 1814 1 1 ...
Looking at the raw data, it’s clear we have to do some basic clean-up. Things I can see immediately:
More things will likely arise, but let’s first tackle these issues.
# 1. drop the index column
df <- select(df, -X)
# 2. make company_id a factor. even though it's a number, it's really a categorical variable
df$company_id <- as.factor(df$company_id)
# 3. set seniority to NA if currently it seems wrong
df$seniority <- na_if(df$seniority, 99)
df$seniority <- na_if(df$seniority, 98)
# 4. add a binary column for quit
df$quit <- ifelse(df$quit_date == "", 0, 1)
# 5. fill in blank cells for quit date with "NA"
df$quit_date <- na_if(df$quit_date, "")
# 6. make date columns as type 'date'
df$join_date <- mdy(df$join_date)
df$quit_date <- mdy(df$quit_date)
# 7. add variable time with company
df$time_with_company <- as.double(difftime(df$quit_date, df$join_date, units = "days"))
Now, check the data again. Looks good!
## employee_id company_id dept seniority salary join_date
## 1 1001444 8 temp_contractor 0 5850 2008-01-26
## 2 388804 8 design 21 191000 2011-05-17
## 3 407990 3 design 9 90000 2012-03-26
## 4 120657 2 engineer 20 298000 2013-04-08
## 5 1006393 1 temp_contractor 0 8509 2008-07-20
## 6 287530 5 marketing 20 180000 2014-06-30
## quit_date quit time_with_company
## 1 2008-04-25 1 90
## 2 2012-03-16 1 304
## 3 2015-04-10 1 1110
## 4 2015-01-30 1 662
## 5 2008-10-18 1 90
## 6 <NA> 0 NA
## 'data.frame': 34702 obs. of 9 variables:
## $ employee_id : int 1001444 388804 407990 120657 1006393 287530 561043 702479 545690 622587 ...
## $ company_id : Factor w/ 12 levels "1","2","3","4",..: 8 8 3 2 1 5 3 7 10 5 ...
## $ dept : Factor w/ 7 levels "customer_service",..: 7 3 3 4 7 5 1 2 2 6 ...
## $ seniority : int 0 21 9 20 0 20 18 7 16 28 ...
## $ salary : int 5850 191000 90000 298000 8509 180000 119000 140000 238000 166000 ...
## $ join_date : Date, format: "2008-01-26" "2011-05-17" ...
## $ quit_date : Date, format: "2008-04-25" "2012-03-16" ...
## $ quit : num 1 1 1 1 1 0 1 1 0 0 ...
## $ time_with_company: num 90 304 1110 662 90 NA 634 612 NA NA ...
Calculate some important metrics, for reference
## [1] 34702
## [1] 0.6774826
summary_stats <- df %>%
group_by(quit) %>%
summarize(mean_salary = mean(salary, na.rm = T),
mean_twc = mean(time_with_company, na.rm = T),
mean_seniority = mean(seniority, na.rm = T))
summary_stats
## # A tibble: 2 x 4
## quit mean_salary mean_twc mean_seniority
## <dbl> <dbl> <dbl> <dbl>
## 1 0 141192. NaN 14.1
## 2 1 80884. 391. 8.11
Above, we see that 67% of the 34,702 employees in our dataset have quit the company. Employees who quit got paid a lot less than those who stayed. They also had less experience prior to starting the job.
Make some diagonsitic plots to get familiar with the data and check the data for outliers or other weirdness.
This is pretty neat– generally, people with more experience are paid more. This often follows a stepwise pattern, with pay bumps associated with 5 year increments of experience.
There are very few records for companies 11 and 12. I’ll have to keep this in mind when making claims or predictions at the company level.
ggplot(df, aes(x = seniority, y = salary)) + mytheme +
geom_point(alpha = 0.2) + geom_smooth() +
facet_wrap(~company_id, scales = 'free')
Some companies (1, 2, 3, and 4) seem to have better retention. Maybe also 11 and 12, but hesitant to say because of so few observations.
Temp contractors by definition will have a very short lifetime with any company. Otherwise, employee lifetime doesn’t vary a whole lot across departments.
I’d expect there to be more of a correlation here… but one interesting takeaway from this plot is that employees seem likely to leave after about 1 year or 2 years with a company.
To note, I removed temporary contractors in this plot, since their “time with company” is not really comparable with the other departments.
The trends by department aren’t super interpretable here, but we definitely can see some separation in salary on the basis of department.
ggplot(subset(df, subset = (dept != "temp_contractor")), aes(x = salary, y = time_with_company, color = dept)) + mytheme +
geom_point(alpha = 0.25) +
geom_hline(aes(yintercept = 365), linetype = "dashed", color = "blue", size = 1) +
geom_hline(aes(yintercept = 365*2), linetype = "dashed", color = "blue", size = 1)
Just because I am interested.
By definition, temporary contractors quit, always. Starting to think I should remove them from this dataset… Otherwise, we don’t see any real trends for turnover frequency by department. For most departments, there’s a pretty even balance between employees retained and employees left.
### Turnover frequency by company
No obvious trends in turnover by company. Removed temporary contractors in this plot. In doing so, we see that we have more records for current employees than those who have quit.
ggplot(subset(df, subset = (dept != "temp_contractor")), aes(x = company_id, fill = as.factor(quit))) + mytheme +
geom_bar(position=position_dodge())
First, pull out the features we’ll use and divide data into train and test sets.
df$quit <- as.factor(df$quit)
# select variables to be included in the model
features <- c("company_id", "dept", "seniority", "salary", "quit")
## partition data into training and test sets
# 80/20 train/test
smp_size <- floor(0.80 * nrow(df))
## set the seed to make partition reproducible
set.seed(123)
train_ind <- sample(seq_len(nrow(df)), size = smp_size)
df_train <- df[train_ind, ]
df_test <- df[-train_ind, ]
This decision tree tells us that the most important factors leading to an employee quitting are salary and seniority. Employees making over $247k are very unlikely to quit
rpart_model <- rpart(quit ~.,
data = df_train[features],
method = 'class',
parms = list(split='information'),
control = rpart.control(usesurrogate = 0,
maxsurrogate = 0))
# Plot the decision tree
rpart.plot(rpart_model, roundint = FALSE, type = 3)
Salary, seniority, and department are important. This corroborates what we see with the decision tree and other plots above.
set.seed(222)
df_RF <- randomForest(quit~ .,
data = df_train[features],
ntree=500, importance = TRUE,
na.action = na.omit)
varImpPlot(df_RF,type=1,
main="Variable Importance (Accuracy)",
sub = "Random Forest Model")
##
## Call:
## randomForest(formula = quit ~ ., data = df_train[features], ntree = 500, importance = TRUE, na.action = na.omit)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 34.33%
## Confusion matrix:
## 0 1 class.error
## 0 3454 5442 0.6117356
## 1 4022 14646 0.2154489
67% accuracy. Given the few features, this is pretty ok! We’re doing better than chance at identifying employees at risk of quitting. Recall = 79%, meaning 79% of employees who resigned were correctly predicted by the model. Precision = 74%, meaning 74% of those identified as ‘at risk’ actually quit.
# generate predictions based on test data ("emp_test")
df_RF_pred <- predict(df_RF, newdata = df_test)
cm <- confusionMatrix(data = df_RF_pred,
reference = df_test$quit,
positive = "1", mode = "prec_recall")
cm
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 902 977
## 1 1286 3732
##
## Accuracy : 0.6719
## 95% CI : (0.6607, 0.683)
## No Information Rate : 0.6828
## P-Value [Acc > NIR] : 0.9744
##
## Kappa : 0.2128
##
## Mcnemar's Test P-Value : 9.511e-11
##
## Precision : 0.7437
## Recall : 0.7925
## F1 : 0.7673
## Prevalence : 0.6828
## Detection Rate : 0.5411
## Detection Prevalence : 0.7276
## Balanced Accuracy : 0.6024
##
## 'Positive' Class : 1
##