Why people leave, should we care?

Some, if not most companies struggle with understanding why people leave their company. This in particular is a vital question to understand since some of them might have great future prospects, maybe good employees today and are worth trying to hold onto. Some might be only there for the paychech, doing bare minimum and tying up recources that could better be used elsewhere in the company.

In this small project I will use and showcase data manipulation and data processing using R to create a flight risk model using logistics regression. Through out I will do my best to explain what I am doing, why and how it will effect the analysis in upcoming steps.

Pre-processing of the data: getting things in order

Lets start with getting the data, retreving the packages needed. Packages are created for R to make analysis, graphing and other statistical methods easier to write and make. These are some of the most advanced statistical methods in the world, created by professionals and distributed freely by them. My stock packages I always include are MASS, tidyverse, readr and fpp2. This is just my most used packages and may or may not be used here.

I took the liberty and printed the structure (a function written as str()) of my data. As we can see I have 10 variables and 14999 observations. Although most of the names of the variables are a bit combersome, so I want to change some of these.

colnames(data) = c("satisfaction", "evaluation", "projects", "hours_month", "exp", "accident"
                   , "left", "promotion_5y", "department", "salary")

These names are esier to writte and remember. Now since we are interested in finding out who is leaving, lets start by a simple table of how many have left.

table(data$left)
## 
##     0     1 
## 11428  3571

Well this is very unhelpfull, 0s and 1s dont tell me anything. Unfortunetly I dont have access to what the 0 and 1 represent. But lets assume for this case that 0s are currently employeed with the company and 1s have left. So I create a categorical variable to better represent this aspect.

data$left_factor = factor(data$left, levels = c(0,1), labels = c("stayed", "left"))

Now lets use it to see how many people are with the company now and have left for each department.

table(data$left_factor, data$department)
##         
##          accounting   hr   IT management marketing product_mng RandD sales
##   stayed        563  524  954        539       655         704   666  3126
##   left          204  215  273         91       203         198   121  1014
##         
##          support technical
##   stayed    1674      2023
##   left       555       697

Since this data set is rather large with many different departments I want to focus on only one. At random I will choose support, because they have a rather unusuall 555 people that have left (I like symmetry).

Looking at the support department

Now in R I have many different options on how to proceed. I could always use the dollar sign to retrive only the support department data (data$support) or I could subset my data frame and create another one called data.sup. There is more typing involved in the first and chances for mistakes to high, so I will go with creating a subset data frame. If I was only looking into making a few descriptive statistics, graphs and such I would not create a new data frame, but since the subsequent analysis will be done only with the support departments data, its easier for me to code and others to understand what I am doing.

I subset using a logic within the subset() function within R

data.sup = subset(data, department == "support")

summary(data.sup)
##   satisfaction      evaluation        projects      hours_month   
##  Min.   :0.0900   Min.   :0.3600   Min.   :2.000   Min.   : 96.0  
##  1st Qu.:0.4400   1st Qu.:0.5600   1st Qu.:3.000   1st Qu.:155.0  
##  Median :0.6500   Median :0.7400   Median :4.000   Median :200.0  
##  Mean   :0.6183   Mean   :0.7231   Mean   :3.804   Mean   :200.8  
##  3rd Qu.:0.8200   3rd Qu.:0.8700   3rd Qu.:5.000   3rd Qu.:246.0  
##  Max.   :1.0000   Max.   :1.0000   Max.   :7.000   Max.   :310.0  
##                                                                   
##       exp            accident           left        promotion_5y     
##  Min.   : 2.000   Min.   :0.0000   Min.   :0.000   Min.   :0.000000  
##  1st Qu.: 3.000   1st Qu.:0.0000   1st Qu.:0.000   1st Qu.:0.000000  
##  Median : 3.000   Median :0.0000   Median :0.000   Median :0.000000  
##  Mean   : 3.393   Mean   :0.1548   Mean   :0.249   Mean   :0.008973  
##  3rd Qu.: 4.000   3rd Qu.:0.0000   3rd Qu.:0.000   3rd Qu.:0.000000  
##  Max.   :10.000   Max.   :1.0000   Max.   :1.000   Max.   :1.000000  
##                                                                      
##       department      salary     left_factor  
##  support   :2229   high  : 141   stayed:1674  
##  accounting:   0   low   :1146   left  : 555  
##  hr        :   0   medium: 942                
##  IT        :   0                              
##  management:   0                              
##  marketing :   0                              
##  (Other)   :   0

That was easy enough. If you look into the department variable you can see this was successful since there are 0 in all other departments. Now lets move on to look into who left.

Who left, and why?

Because of the small scope of this projects I want to focus on satisfaction and see if there are any obvious differences between high satisfaction and low. We would think that low satisfaction will increase the chances of people leaving but lets have the data determine that.

Lets plot this and see what we get.

ggplot(data = data.sup, mapping = aes(x = satisfaction))+
  geom_density(fill = "green")

Well satisfaction seems to be very high within the department overall. Now lets divide this with who has left and who stayed.

ggplot(data = data.sup, mapping = aes(x = satisfaction))+
  geom_density(aes(fill = left_factor))+
  facet_wrap(~left_factor)

Well this graph seems to indicate that satisfactions plays an important role. There are obvious groupings within the left group that needs to be looked into more closely, but is not the subject of this analysis.

Now that we know that satisfaction is obviously having an effect, how large, is it significantly impacting our workers and how large is it when controling for other variables.

The modeling process: Cameras set, light set, Logistics regression GO.

Understanding logistics regression is not particularly hard. It helps us predict the outcome of a binary variable (0s and 1s). Its almost identical with linear regression, but the differences are that we can create this model and present a logarithm regression coefficient. You will understand this when we get more into the results but for now, set our left_factor variable as the one we want to predict (y) and use the other variables (x) to predict the outcome for each individual based on those variables.

Creating the model is easy enough because we already have done all the data pre-processing and research before, so R makes this very easy.

my.log.model = glm(left~satisfaction+evaluation+projects+hours_month+
                   exp, data = data.sup, family = "binomial")
summary(my.log.model)
## 
## Call:
## glm(formula = left ~ satisfaction + evaluation + projects + hours_month + 
##     exp, family = "binomial", data = data.sup)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1730  -0.6708  -0.4460  -0.1958   2.4957  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -0.283175   0.296569  -0.955  0.33966    
## satisfaction -4.204771   0.248523 -16.919  < 2e-16 ***
## evaluation    0.899603   0.378761   2.375  0.01754 *  
## projects     -0.331960   0.055315  -6.001 1.96e-09 ***
## hours_month   0.003847   0.001300   2.958  0.00309 ** 
## exp           0.392798   0.041703   9.419  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2501.9  on 2228  degrees of freedom
## Residual deviance: 2034.2  on 2223  degrees of freedom
## AIC: 2046.2
## 
## Number of Fisher Scoring iterations: 5

Now this model is significant and tells us a lot of useful information. When controling for every other variable, for every unit increase (towards leaving) satisfactions has a -4.21 log odds ratio of leaving. Well thats very uninformative, even if you could calculate the exponent in your head real quickly. Well this is not a problem. I will just create a function that returns the probability score using the log odds coefficients. Since this is my function I will not include the code it self but rather show you what comes out of it.

logit.2.prob(coef(my.log.model))
##  (Intercept) satisfaction   evaluation     projects  hours_month 
##   0.42967551   0.01470474   0.71086789   0.41776370   0.50096176 
##          exp 
##   0.59695597

So what does this tell us, the probability score of people leaving. So high satisfaction has very low probability of leaving, high evaluation has a high probability of leaving and so on for each variable. This means, using R, we could identify individuals with a large probability to leave and maybe do something about it or at least relay this to HR for a them to take some action

Now lets look at the confidence interval of our prediction using a package called plotly. This package takes my plots made in ggplot and makes them interactive so we can play with it a bit more.

ggplotly(q)
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`

Now I have all the probability scores stored away in a variable. What I want to do is plot this to understand the distribution better. Before I do this I want to make this as a factor, cutting my continuous variable into chunks for better interpertation

prob_leaving = cut(b, breaks = c(0, 0.25, 0.5, 0.75,1), labels = c("Low", "Fair", "High", "Very High"))

Now lets see how it looks when we plot satisfaction(x) and evaluation(y) as we group the plot by their probability of leaving and color by salary.

If we wanted we could also use plotly to make this graph interactive, showcaseing more functionality and readability of the graph

ggplotly(plot_prob)
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`

I will be adding more to this analysis but I hope you have enjoyed it so far.