Some, if not most companies struggle with understanding why people leave their company. This in particular is a vital question to understand since some of them might have great future prospects, maybe good employees today and are worth trying to hold onto. Some might be only there for the paychech, doing bare minimum and tying up recources that could better be used elsewhere in the company.
In this small project I will use and showcase data manipulation and data processing using R to create a flight risk model using logistics regression. Through out I will do my best to explain what I am doing, why and how it will effect the analysis in upcoming steps.
Lets start with getting the data, retreving the packages needed. Packages are created for R to make analysis, graphing and other statistical methods easier to write and make. These are some of the most advanced statistical methods in the world, created by professionals and distributed freely by them. My stock packages I always include are MASS, tidyverse, readr and fpp2. This is just my most used packages and may or may not be used here.
I took the liberty and printed the structure (a function written as str()) of my data. As we can see I have 10 variables and 14999 observations. Although most of the names of the variables are a bit combersome, so I want to change some of these.
colnames(data) = c("satisfaction", "evaluation", "projects", "hours_month", "exp", "accident"
, "left", "promotion_5y", "department", "salary")
These names are esier to writte and remember. Now since we are interested in finding out who is leaving, lets start by a simple table of how many have left.
table(data$left)
##
## 0 1
## 11428 3571
Well this is very unhelpfull, 0s and 1s dont tell me anything. Unfortunetly I dont have access to what the 0 and 1 represent. But lets assume for this case that 0s are currently employeed with the company and 1s have left. So I create a categorical variable to better represent this aspect.
data$left_factor = factor(data$left, levels = c(0,1), labels = c("stayed", "left"))
Now lets use it to see how many people are with the company now and have left for each department.
table(data$left_factor, data$department)
##
## accounting hr IT management marketing product_mng RandD sales
## stayed 563 524 954 539 655 704 666 3126
## left 204 215 273 91 203 198 121 1014
##
## support technical
## stayed 1674 2023
## left 555 697
Since this data set is rather large with many different departments I want to focus on only one. At random I will choose support, because they have a rather unusuall 555 people that have left (I like symmetry).
Now in R I have many different options on how to proceed. I could always use the dollar sign to retrive only the support department data (data$support) or I could subset my data frame and create another one called data.sup. There is more typing involved in the first and chances for mistakes to high, so I will go with creating a subset data frame. If I was only looking into making a few descriptive statistics, graphs and such I would not create a new data frame, but since the subsequent analysis will be done only with the support departments data, its easier for me to code and others to understand what I am doing.
I subset using a logic within the subset() function within R
data.sup = subset(data, department == "support")
summary(data.sup)
## satisfaction evaluation projects hours_month
## Min. :0.0900 Min. :0.3600 Min. :2.000 Min. : 96.0
## 1st Qu.:0.4400 1st Qu.:0.5600 1st Qu.:3.000 1st Qu.:155.0
## Median :0.6500 Median :0.7400 Median :4.000 Median :200.0
## Mean :0.6183 Mean :0.7231 Mean :3.804 Mean :200.8
## 3rd Qu.:0.8200 3rd Qu.:0.8700 3rd Qu.:5.000 3rd Qu.:246.0
## Max. :1.0000 Max. :1.0000 Max. :7.000 Max. :310.0
##
## exp accident left promotion_5y
## Min. : 2.000 Min. :0.0000 Min. :0.000 Min. :0.000000
## 1st Qu.: 3.000 1st Qu.:0.0000 1st Qu.:0.000 1st Qu.:0.000000
## Median : 3.000 Median :0.0000 Median :0.000 Median :0.000000
## Mean : 3.393 Mean :0.1548 Mean :0.249 Mean :0.008973
## 3rd Qu.: 4.000 3rd Qu.:0.0000 3rd Qu.:0.000 3rd Qu.:0.000000
## Max. :10.000 Max. :1.0000 Max. :1.000 Max. :1.000000
##
## department salary left_factor
## support :2229 high : 141 stayed:1674
## accounting: 0 low :1146 left : 555
## hr : 0 medium: 942
## IT : 0
## management: 0
## marketing : 0
## (Other) : 0
That was easy enough. If you look into the department variable you can see this was successful since there are 0 in all other departments. Now lets move on to look into who left.
Because of the small scope of this projects I want to focus on satisfaction and see if there are any obvious differences between high satisfaction and low. We would think that low satisfaction will increase the chances of people leaving but lets have the data determine that.
Lets plot this and see what we get.
ggplot(data = data.sup, mapping = aes(x = satisfaction))+
geom_density(fill = "green")
Well satisfaction seems to be very high within the department overall. Now lets divide this with who has left and who stayed.
ggplot(data = data.sup, mapping = aes(x = satisfaction))+
geom_density(aes(fill = left_factor))+
facet_wrap(~left_factor)
Well this graph seems to indicate that satisfactions plays an important role. There are obvious groupings within the left group that needs to be looked into more closely, but is not the subject of this analysis.
Now that we know that satisfaction is obviously having an effect, how large, is it significantly impacting our workers and how large is it when controling for other variables.
Understanding logistics regression is not particularly hard. It helps us predict the outcome of a binary variable (0s and 1s). Its almost identical with linear regression, but the differences are that we can create this model and present a logarithm regression coefficient. You will understand this when we get more into the results but for now, set our left_factor variable as the one we want to predict (y) and use the other variables (x) to predict the outcome for each individual based on those variables.
Creating the model is easy enough because we already have done all the data pre-processing and research before, so R makes this very easy.
my.log.model = glm(left~satisfaction+evaluation+projects+hours_month+
exp, data = data.sup, family = "binomial")
summary(my.log.model)
##
## Call:
## glm(formula = left ~ satisfaction + evaluation + projects + hours_month +
## exp, family = "binomial", data = data.sup)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1730 -0.6708 -0.4460 -0.1958 2.4957
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.283175 0.296569 -0.955 0.33966
## satisfaction -4.204771 0.248523 -16.919 < 2e-16 ***
## evaluation 0.899603 0.378761 2.375 0.01754 *
## projects -0.331960 0.055315 -6.001 1.96e-09 ***
## hours_month 0.003847 0.001300 2.958 0.00309 **
## exp 0.392798 0.041703 9.419 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2501.9 on 2228 degrees of freedom
## Residual deviance: 2034.2 on 2223 degrees of freedom
## AIC: 2046.2
##
## Number of Fisher Scoring iterations: 5
Now this model is significant and tells us a lot of useful information. When controling for every other variable, for every unit increase (towards leaving) satisfactions has a -4.21 log odds ratio of leaving. Well thats very uninformative, even if you could calculate the exponent in your head real quickly. Well this is not a problem. I will just create a function that returns the probability score using the log odds coefficients. Since this is my function I will not include the code it self but rather show you what comes out of it.
logit.2.prob(coef(my.log.model))
## (Intercept) satisfaction evaluation projects hours_month
## 0.42967551 0.01470474 0.71086789 0.41776370 0.50096176
## exp
## 0.59695597
So what does this tell us, the probability score of people leaving. So high satisfaction has very low probability of leaving, high evaluation has a high probability of leaving and so on for each variable. This means, using R, we could identify individuals with a large probability to leave and maybe do something about it or at least relay this to HR for a them to take some action
Now lets look at the confidence interval of our prediction using a package called plotly. This package takes my plots made in ggplot and makes them interactive so we can play with it a bit more.
ggplotly(q)
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`
Now I have all the probability scores stored away in a variable. What I want to do is plot this to understand the distribution better. Before I do this I want to make this as a factor, cutting my continuous variable into chunks for better interpertation
prob_leaving = cut(b, breaks = c(0, 0.25, 0.5, 0.75,1), labels = c("Low", "Fair", "High", "Very High"))
Now lets see how it looks when we plot satisfaction(x) and evaluation(y) as we group the plot by their probability of leaving and color by salary.
If we wanted we could also use plotly to make this graph interactive, showcaseing more functionality and readability of the graph
ggplotly(plot_prob)
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`
I will be adding more to this analysis but I hope you have enjoyed it so far.