1 Workforce Analysis

This data was downloaded from a Kaggle closed competition. I downloaded this data through a kernel published for the competition. The data is a simulation of a survey completed by the employees of a large company. The purpose of this analysis is to determine the primary reasons why high-value employees leave the company and to predict which of our high-value employees will leave next. Since the data does not provide information about the company, let’s assume this is a company in the insurance industry. Since I am using a dataset representative of the entire population of this company, sampling will not be necessary to ensure I have calculated the true parameters of this population.

1.1 Exploratory Analysis

We can begin this analysis by exploring the data collected and where possible, comparing these statistics to industry wide metrics.

## Observations: 14,999
## Variables: 10
## $ satisfaction_level    <dbl> 0.38, 0.80, 0.11, 0.72, 0.37, 0.41, 0.10...
## $ last_evaluation       <dbl> 0.53, 0.86, 0.88, 0.87, 0.52, 0.50, 0.77...
## $ number_project        <int> 2, 5, 7, 5, 2, 2, 6, 5, 5, 2, 2, 6, 4, 2...
## $ average_montly_hours  <int> 157, 262, 272, 223, 159, 153, 247, 259, ...
## $ time_spend_company    <int> 3, 6, 4, 5, 3, 3, 4, 5, 5, 3, 3, 4, 5, 3...
## $ Work_accident         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ left                  <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ promotion_last_5years <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ sales                 <chr> "sales", "sales", "sales", "sales", "sal...
## $ salary                <chr> "low", "medium", "medium", "low", "low",...

As you can see from the glimpse of the data above, there is no information directly informing us of the reason our valuable employees are leaving our company. We can use this survey data, matched with anonymized personal information, to glean the likely reasons these valuable employees left. Let’s now take a look at the summary statistics for each variable in the data. There are 0 missing values, so we will not have to impute any missing values.

##  satisfaction_level last_evaluation  number_project  average_montly_hours
##  Min.   :0.0900     Min.   :0.3600   Min.   :2.000   Min.   : 96.0       
##  1st Qu.:0.4400     1st Qu.:0.5600   1st Qu.:3.000   1st Qu.:156.0       
##  Median :0.6400     Median :0.7200   Median :4.000   Median :200.0       
##  Mean   :0.6128     Mean   :0.7161   Mean   :3.803   Mean   :201.1       
##  3rd Qu.:0.8200     3rd Qu.:0.8700   3rd Qu.:5.000   3rd Qu.:245.0       
##  Max.   :1.0000     Max.   :1.0000   Max.   :7.000   Max.   :310.0       
##  time_spend_company Work_accident         left       
##  Min.   : 2.000     Min.   :0.0000   Min.   :0.0000  
##  1st Qu.: 3.000     1st Qu.:0.0000   1st Qu.:0.0000  
##  Median : 3.000     Median :0.0000   Median :0.0000  
##  Mean   : 3.498     Mean   :0.1446   Mean   :0.2381  
##  3rd Qu.: 4.000     3rd Qu.:0.0000   3rd Qu.:0.0000  
##  Max.   :10.000     Max.   :1.0000   Max.   :1.0000  
##  promotion_last_5years    sales              salary         
##  Min.   :0.00000       Length:14999       Length:14999      
##  1st Qu.:0.00000       Class :character   Class :character  
##  Median :0.00000       Mode  :character   Mode  :character  
##  Mean   :0.02127                                            
##  3rd Qu.:0.00000                                            
##  Max.   :1.00000

We can learn quite a bit about the employees of this company just by examining some summary statistics about each variable. The following information is interpreted from the table above, the turnover rate is 23.8%, the satisfaction level is around 61% and the performance average is around 71%. We also see that on average, people work nearly 4 projects a year and about 200 hours per month, and average tenure of about 3.5 years.

You can see from the summary statistics that this workforce is relatively happy, despite the highest counts falling under satisfaction levels near zero. The time an employee has worked for the company immediately stands out as a factor in a valuable employee elected to find another company. A vast majority of the company has only worked at the company for roughly 3 years. According to the Bureau of Labor Statistics’ 2016 Economic News Release, the average tenure for workers in the insurance industry was 4.7 years. This is signficantly different from the average tenure of this insurance company.

The hexbin plot you see above gives us a high level view of the the distribution of employees’ satisfaction levels in relation to their last performance evaluation. As you can see and as expected, there is clearly some positive correlation, however, you can see that there is a pocket of high performing employees who very much dislike their job. We will focus on pinpointing the reasons these high performers left the company in the coming sections.

HR_correlation <- hr %>% select(satisfaction_level:promotion_last_5years)
M <- cor(HR_correlation)
corrplot(M, method="circle")

The correlation matrix you see above gives us a pretty solid high level view of the factors resulting in an employee leaving the company. It’s clear that the number one reason that an employee is satisfaction level, not a terribly surprising trend. It also appears there is a minor negative correlation witn number of projects worked over the past year. Lastly, you can se the positive correlations around the number of projects worked and average number of hours worked per month, and last evaluation. It is clear that those employees who work the most hours, and have the most average number of projects through out the year. It is likely that those employees that have high performance scores, have worked long hours, a higher average number of projects worked, and did not get promoted in the past five years. Here it is important to remember that average tenure is quite low compared to industry standards. Below i have printed a correlation matrix for further information.

##                       satisfaction_level last_evaluation number_project
## satisfaction_level            1.00000000     0.105021214   -0.142969586
## last_evaluation               0.10502121     1.000000000    0.349332589
## number_project               -0.14296959     0.349332589    1.000000000
## average_montly_hours         -0.02004811     0.339741800    0.417210634
## time_spend_company           -0.10086607     0.131590722    0.196785891
## Work_accident                 0.05869724    -0.007104289   -0.004740548
## left                         -0.38837498     0.006567120    0.023787185
## promotion_last_5years         0.02560519    -0.008683768   -0.006063958
##                       average_montly_hours time_spend_company
## satisfaction_level            -0.020048113       -0.100866073
## last_evaluation                0.339741800        0.131590722
## number_project                 0.417210634        0.196785891
## average_montly_hours           1.000000000        0.127754910
## time_spend_company             0.127754910        1.000000000
## Work_accident                 -0.010142888        0.002120418
## left                           0.071287179        0.144822175
## promotion_last_5years         -0.003544414        0.067432925
##                       Work_accident        left promotion_last_5years
## satisfaction_level      0.058697241 -0.38837498           0.025605186
## last_evaluation        -0.007104289  0.00656712          -0.008683768
## number_project         -0.004740548  0.02378719          -0.006063958
## average_montly_hours   -0.010142888  0.07128718          -0.003544414
## time_spend_company      0.002120418  0.14482217           0.067432925
## Work_accident           1.000000000 -0.15462163           0.039245435
## left                   -0.154621634  1.00000000          -0.061788107
## promotion_last_5years   0.039245435 -0.06178811           1.000000000

1.1.1 Why Our high value employees are leaving

In the last section, we developed some theories as to why many of the high performing employees at this company are leaving. Specifically, that those high performers are leaving because they are not being promoted. Let’s take a look at only those high performing employees that left to see what we might be able to extrapolate from the data.

hr_hist <- hr %>% filter(left==1)
par(mfrow=c(1,3))
hist(hr_hist$satisfaction_level,col="#3090C7", main = "Satisfaction level", xlab = "Satisfaction Level") 
hist(hr_hist$last_evaluation,col="#3090C7", main = "Last evaluation", xlab = "Last Evaluation")
hist(hr_hist$average_montly_hours,col="#3090C7", main = "Average montly hours", xlab = "Average Number of hours worked in months")

As stated before, there is a positive correlation between performance score, satisfaction level, and whether or not the employee leaves. Let’s take a look at whether or not an employee had an injury at work and the perceived level of compensation.

expand.grid()
## data frame with 0 columns and 0 rows
work_bar <- ggplot(hr, aes(x = Work_accident))+geom_bar(fill = "blue")
salary_bar <- ggplot(hr, aes(x = salary))+ geom_bar(fill = "blue")
grid.arrange(work_bar, salary_bar)

Work accidents don’t occur very frequently so it is probably not a major factor in high performing employees leaving the company. As for the quality of an employee’s perceived salary, it is clear that most employees at the company believe their salaries are low in comparison to others doing the same work. Out of the 15,000 employees at the company, there are 3571 who left. Of the 3571 employees who left, 2014 were high performers.

par(mfrow = c(1,1))
hr_good_leaving_people2 <- hr %>% filter(last_evaluation >= 0.70 | time_spend_company >= 4 | number_project > 5)
hr_good_people_select <- hr_good_leaving_people2 %>% select(satisfaction_level, number_project: promotion_last_5years)
M <- cor(hr_good_people_select)
Good_Cor <- corrplot(M, method="circle")

We are starting to get a better picture of the particular reasons these high performers are leaving. With the correlation plot above for just the high performing employees, you can see that there is a fairly significant positive correlation between those who left and how much they were working. There is also a clear negative correlation with how much these employees were working and their job satisfaction score. Below is a table of summary data for just the high performing employees.

##  satisfaction_level last_evaluation  number_project  average_montly_hours
##  Min.   :0.090      Min.   :0.3600   Min.   :2.000   Min.   : 96.0       
##  1st Qu.:0.490      1st Qu.:0.7300   1st Qu.:3.000   1st Qu.:171.0       
##  Median :0.680      Median :0.8300   Median :4.000   Median :218.0       
##  Mean   :0.617      Mean   :0.8015   Mean   :4.159   Mean   :211.8       
##  3rd Qu.:0.830      3rd Qu.:0.9100   3rd Qu.:5.000   3rd Qu.:253.0       
##  Max.   :1.000      Max.   :1.0000   Max.   :7.000   Max.   :310.0       
##  time_spend_company Work_accident         left       
##  Min.   : 2.000     Min.   :0.0000   Min.   :0.0000  
##  1st Qu.: 3.000     1st Qu.:0.0000   1st Qu.:0.0000  
##  Median : 4.000     Median :0.0000   Median :0.0000  
##  Mean   : 3.916     Mean   :0.1521   Mean   :0.2061  
##  3rd Qu.: 5.000     3rd Qu.:0.0000   3rd Qu.:0.0000  
##  Max.   :10.000     Max.   :1.0000   Max.   :1.0000  
##  promotion_last_5years    sales              salary         
##  Min.   :0.00000       Length:9772        Length:9772       
##  1st Qu.:0.00000       Class :character   Class :character  
##  Median :0.00000       Mode  :character   Mode  :character  
##  Mean   :0.02384                                            
##  3rd Qu.:0.00000                                            
##  Max.   :1.00000

Generally, the primary factors in a high performing employee leaving are working longer hours, doing more projects, and did not receive a promotion in the last 5 years. Let’s take a look at turnover rate by job function, to see if we can identify a job function which might have an effect on high value employees leaving.


We can see from the plot that there is not much difference among turnover rates for each job function. I was expecting for sales or support to have a significantly higher turnover rate due to the type of job, but the turnover rate for these jobs are not signficantly different from the turnover rates in other job functions.

Over the course of this exploratory analysis we have deduced that the major factors in a high value employee deciding to leave the company are that these high performing employees are not being rewarded for their exceptional productivity. While the number for high performing employees will generally be higher than the number of promotions or available growth opportunities, there appears to be an opportunity to implement an incentive or bonus program to reward these high-performers who may not have received a promotion.

1.2 Predictive Analysis

The exploratory analysis in the last section, showed us the reasons high-value employees were leaving this company. The theme seems to be, not be rewarded for said quality performance. In this section, we look at predicting which high value employees will leave next. I am going to use 2 different methods to obtain predictions. The first will be with recursive partitioning, or a decision tree, the second will be logistic regression. First we need to set up some cross-validation parameters for our models. After we create our models based on said cross-validation parameters, we can split the data into training and test sets to execute the final model.

##  satisfaction_level last_evaluation  number_project  average_montly_hours
##  Min.   :0.0900     Min.   :0.3600   Min.   :2.000   Min.   : 96.0       
##  1st Qu.:0.3500     1st Qu.:0.7200   1st Qu.:3.000   1st Qu.:173.0       
##  Median :0.6600     Median :0.8700   Median :4.000   Median :222.0       
##  Mean   :0.5902     Mean   :0.8111   Mean   :4.294   Mean   :214.2       
##  3rd Qu.:0.8200     3rd Qu.:0.9300   3rd Qu.:5.000   3rd Qu.:255.0       
##  Max.   :1.0000     Max.   :1.0000   Max.   :7.000   Max.   :310.0       
##  time_spend_company Work_accident    left     promotion_last_5years
##  Min.   : 2.000     Min.   :0.0000   0:5611   Min.   :0.00000      
##  1st Qu.: 3.000     1st Qu.:0.0000   1:1984   1st Qu.:0.00000      
##  Median : 4.000     Median :0.0000            Median :0.00000      
##  Mean   : 4.294     Mean   :0.1442            Mean   :0.02396      
##  3rd Qu.: 5.000     3rd Qu.:0.0000            3rd Qu.:0.00000      
##  Max.   :10.000     Max.   :1.0000            Max.   :1.00000      
##     sales              salary         
##  Length:7595        Length:7595       
##  Class :character   Class :character  
##  Mode  :character   Mode  :character  
##                                       
##                                       
## 
train_control<- trainControl(method="repeatedcv", number=5, repeats=3)

1.2.1 Recursive Partitioning Decision Tree

Now that we’ve created our cross-validation control we can fit a recursive partitioning model. This model is a decision tree, which attempts to classify subgroups by splitting the data into sub-populations based on the features in the data set.

rpartmodel<- train(left~., data=hr_model, trControl=train_control, method="rpart")
predictions<- predict(rpartmodel,hr_model)
hr_model_rpart<- cbind(hr_model,predictions)
head(hr_model_rpart)
##   satisfaction_level last_evaluation number_project average_montly_hours
## 1               0.80            0.86              5                  262
## 2               0.11            0.88              7                  272
## 3               0.72            0.87              5                  223
## 4               0.10            0.77              6                  247
## 5               0.92            0.85              5                  259
## 6               0.89            1.00              5                  224
##   time_spend_company Work_accident left promotion_last_5years sales salary
## 1                  6             0    1                     0 sales medium
## 2                  4             0    1                     0 sales medium
## 3                  5             0    1                     0 sales    low
## 4                  4             0    1                     0 sales    low
## 5                  5             0    1                     0 sales    low
## 6                  5             0    1                     0 sales    low
##   predictions
## 1           1
## 2           1
## 3           1
## 4           1
## 5           1
## 6           1

Now that we’ve fit the model, we can look at a confusion matrix.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 5444  246
##          1  167 1738
##                                           
##                Accuracy : 0.9456          
##                  95% CI : (0.9403, 0.9506)
##     No Information Rate : 0.7388          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8573          
##  Mcnemar's Test P-Value : 0.000124        
##                                           
##             Sensitivity : 0.9702          
##             Specificity : 0.8760          
##          Pos Pred Value : 0.9568          
##          Neg Pred Value : 0.9123          
##              Prevalence : 0.7388          
##          Detection Rate : 0.7168          
##    Detection Prevalence : 0.7492          
##       Balanced Accuracy : 0.9231          
##                                           
##        'Positive' Class : 0               
## 

The confusion matrix provides us with some diagnotic statistics for the accuracy of the model. The model seems to fit quite well, with 94 percent accuracy. The sensitivity, or ability for the model to predict true positives, is strong at 97 percent, and the specificity, or ability for the model to predict true negatives, is also fairly strong at 87 percent. Before we move on to the logistic regression, we should plot the distribution of true positives to false positives to make sure we have a proper threshold for deciding who will actually leave.

As you can see from the ROC plot, our area under the curve appears to be maximized from the looks of the plot. Let’s move on to our traditional logistic regression to see how this model compares.

1.2.2 Logistic Regression

While the decision tree model fit fairly well, let’s take a look at fitting a logistic regression model to the data as well.

glm_model <- train(left~., data=hr_model, trControl=train_control, method="LogitBoost")
predictions<- predict(glm_model,hr_model)
glm_model_complete <- cbind(hr_model,predictions)

Now that we’ve fit the model let’s take a look at the summary data in a confusion matrix.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 5553  259
##          1   58 1725
##                                           
##                Accuracy : 0.9583          
##                  95% CI : (0.9535, 0.9626)
##     No Information Rate : 0.7388          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8882          
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9897          
##             Specificity : 0.8695          
##          Pos Pred Value : 0.9554          
##          Neg Pred Value : 0.9675          
##              Prevalence : 0.7388          
##          Detection Rate : 0.7311          
##    Detection Prevalence : 0.7652          
##       Balanced Accuracy : 0.9296          
##                                           
##        'Positive' Class : 0               
## 

As you can see from the confusion matrix output, the overall accuracy is a little bit better. It shows that the true positive rate at 98 percent, which is better than the decision tree model. The specificity or true negative rate for this model is a bit lower at 87.55 percent. The difference in specificity is negligible in relation to the added accuracy and true positive rate of this model. In addition, false negatives are not a sticking point for this analysis due to the nature of our prediction. Further, false positives in this situation are not detrimental to the business because providing a good employee with a reward or pay bump who is not planning on leaving should not negatively impact the business. Before we move on, let’s confirm that our AUC is maximized with a new ROC.

The visualization of true positives to false negatives shows us that the area under the curve appears to be maximized. It appears that the traditional logistic regression wins out here by a very small margin, providing advantages in accuracy and sensitivity over the decision tree model with only a minor disadvantage in specificity. As stated previously, false positives are not huge concern in this scenario as providing a bonus or incentive to an employee not planning on leaving, will not negatively impact the company. Both models appear to be pretty robust with a kappa of roughly .85 for each. Because there is a minor advantage with the logistic regression we will use this model for our final predictions.

2 Final Model

Now that we have decided on the best model, we need to divide the data into training and test sets to ensure we don’t over fit the data.

set.seed(7)
trainh <- createDataPartition(hr_model$left, p = .75, list = FALSE)
hr_train <- hr_model[ trainh,]
hr_test <- hr_model[-trainh,]

Now we can fit the model and make the predictions for the logistic regression model.

logreg <- glm(left ~ ., family=binomial(logit), data=hr_train)
prob_to_leave <- predict(logreg,newdata= hr_test,type="response")
prediction_final = data.frame(prob_to_leave)
prediction_final$performance <- hr_test$last_evaluation

In the code chunk above, I have fitted the model and prepared a new data frame for visualization. Below you can see a plot showing the relationship between an employees last evaluation and the likelihood that they will leave the company.

plot(prediction_final$prob_to_leave,prediction_final$performance)

The table below will provide you with the employees my model indicates will leave the company in the next year.

prediction_final$priority=prediction_final$performance*prediction_final$prob_to_leave
orderprediction <-prediction_final[order(prediction_final$priority,decreasing = TRUE),]
orderprediction <- head(orderprediction, n=300)
datatable(orderprediction)




2.1 Appendix

The following are supplemental plots associated with my exploratory analysis. Please see below for further information.