Preview

We providing the equivalent R-code for the same exercise with some additional empahsis on a few extra steps like -

GGPLOT usage
Boruta Algorithm
Alternate ways of breaking down data in R are included along the way

The code follows the standard ML pipeline of Preprocessing, exploration, modelling, iterating and insights. Do feel free to raise your questions.. There is a lot of emphasis on the visualization aspect to understand the data.

Part 1: Obtaining the Data

#Read the analytics csv file and store our dataset into a dataframe called "df"
library(readr)

## Warning: package 'readr' was built under R version 3.5.2

data_set<-read.csv('D:/Back up May 2018/Documents/Rattle-data/HR_comma_sep1.csv', sep=";")

Part 2: Data Cleaning

Always better to run some basic checks and see if there are missing values or unusual MAX Values in every variable. The data aint gonna be friendly.

# Check to see if there are any missing values in our data and checking overall summary
summary(data_set)

##  satisfaction_level last_evaluation  number_project  average_montly_hours
##  Min.   :0.0900     Min.   :0.3600   Min.   :2.000   Min.   : 96.0       
##  1st Qu.:0.4400     1st Qu.:0.5600   1st Qu.:3.000   1st Qu.:156.0       
##  Median :0.6400     Median :0.7200   Median :4.000   Median :200.0       
##  Mean   :0.6128     Mean   :0.7161   Mean   :3.803   Mean   :201.1       
##  3rd Qu.:0.8200     3rd Qu.:0.8700   3rd Qu.:5.000   3rd Qu.:245.0       
##  Max.   :1.0000     Max.   :1.0000   Max.   :7.000   Max.   :310.0       
##                                                                          
##  time_spend_company Work_accident     left       promotion_last_5years
##  Min.   : 2.000     Min.   :0.0000   NO :11428   Min.   :0.00000      
##  1st Qu.: 3.000     1st Qu.:0.0000   YES: 3571   1st Qu.:0.00000      
##  Median : 3.000     Median :0.0000               Median :0.00000      
##  Mean   : 3.498     Mean   :0.1446               Mean   :0.02127      
##  3rd Qu.: 4.000     3rd Qu.:0.0000               3rd Qu.:0.00000      
##  Max.   :10.000     Max.   :1.0000               Max.   :1.00000      
##                                                                       
##          sales         salary    
##  sales      :4140   high  :1237  
##  technical  :2720   low   :7316  
##  support    :2229   medium:6446  
##  IT         :1227                
##  product_mng: 902                
##  marketing  : 858                
##  (Other)    :2923

Renaming the irrelevant Variable name

The only variable which doesnt represent what it stands is the “Sales” variable which clearly represents Job Role

#Data overview  Just so that the summary is not good enough to get an idea of the data into your head

#Renaming dataset 
library(plyr)
library(knitr)

## Warning: package 'knitr' was built under R version 3.5.2

data_set<-rename(data_set, c("sales"="role"))
data_set<-rename(data_set, c("time_spend_company"="exp_in_company"))
names(data_set)[10]<-"salary"
kable(head(data_set))

satisfaction_level	last_evaluation	number_project	average_montly_hours	exp_in_company	left	role	salary
0.38	0.53	2	157	3	YES	sales	low
0.80	0.86	5	262	6	YES	sales	medium
0.11	0.88	7	272	4	YES	sales	medium
0.72	0.87	5	223	5	YES	sales	low
0.37	0.52	2	159	3	YES	sales	low
0.41	0.50	2	153	3	YES	sales	low

Part 3: Exploring the Data

3a. Statistical Overview

The dataset has:

About 15,000 employee observations and 10 features
The company had a turnover rate of about 24%
Mean satisfaction of employees is 0.61

# The dataset contains 10 columns and 14999 observations
dim(data_set)

## [1] 14999    10

# Check the type of our features. 
data_set$left<-as.numeric(data_set$left)
data_set$left<-data_set$left-1
str(data_set)

## 'data.frame':    14999 obs. of  10 variables:
##  $ satisfaction_level   : num  0.38 0.8 0.11 0.72 0.37 0.41 0.1 0.92 0.89 0.42 ...
##  $ last_evaluation      : num  0.53 0.86 0.88 0.87 0.52 0.5 0.77 0.85 1 0.53 ...
##  $ number_project       : int  2 5 7 5 2 2 6 5 5 2 ...
##  $ average_montly_hours : int  157 262 272 223 159 153 247 259 224 142 ...
##  $ exp_in_company       : int  3 6 4 5 3 3 4 5 5 3 ...
##  $ Work_accident        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ left                 : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ promotion_last_5years: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ role                 : Factor w/ 10 levels "accounting","hr",..: 8 8 8 8 8 8 8 8 8 8 ...
##  $ salary               : Factor w/ 3 levels "high","low","medium": 2 3 3 2 2 2 2 2 2 2 ...

# Looks like about 76% of employees stayed and 24% of employees left. 
# NOTE: When performing cross validation, its important to maintain this turnover ratio
attrition<-as.factor(data_set$left)
summary(attrition)

##     0     1 
## 11428  3571

perc_attrition_rate<-sum(data_set$left/length(data_set$left))*100
#percentage of attrition
print(perc_attrition_rate)

## [1] 23.80825

# Overview of summary (Turnover V.S. Non-turnover)
cor_vars<-data_set[,c("satisfaction_level","last_evaluation","number_project","average_montly_hours","exp_in_company","Work_accident","left","promotion_last_5years")]

aggregate(cor_vars[,c("satisfaction_level","last_evaluation","number_project","average_montly_hours","exp_in_company","Work_accident","promotion_last_5years")], by=list(Category=cor_vars$left), FUN=mean)

##   Category satisfaction_level last_evaluation number_project
## 1        0          0.6668096       0.7154734       3.786664
## 2        1          0.4400980       0.7181126       3.855503
##   average_montly_hours exp_in_company Work_accident promotion_last_5years
## 1             199.0602       3.380032    0.17500875           0.026251313
## 2             207.4192       3.876505    0.04732568           0.005320638

3b. Correlation Matrix & Heatmap

Moderate Positively Correlated Features:

projectCount vs evaluation: 0.349333
projectCount vs averageMonthlyHours: 0.417211
averageMonthlyHours vs evaluation: 0.339742

Moderate Negatively Correlated Feature:

satisfaction vs turnover: -0.388375

Stop and Think:

What features affect our target variable the most (turnover)?
What features have strong correlations with each other? Can we do a more in depth examination of these features?

Summary:

From the heatmap, there is a positive(+) correlation between projectCount, averageMonthlyHours, and evaluation. Which could mean that the employees who spent more hours and did more projects were evaluated highly.
For the negative(-) relationships, turnover and satisfaction are highly correlated. I’m assuming that people tend to leave a company more when they are less satisfied.

#Correlation Matrix
library(reshape2)
library(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.5.3

cor_vars<-data_set[,c("satisfaction_level","last_evaluation","number_project","average_montly_hours","exp_in_company","Work_accident","left","promotion_last_5years")]
cor(cor_vars)

##                       satisfaction_level last_evaluation number_project
## satisfaction_level            1.00000000     0.105021214   -0.142969586
## last_evaluation               0.10502121     1.000000000    0.349332589
## number_project               -0.14296959     0.349332589    1.000000000
## average_montly_hours         -0.02004811     0.339741800    0.417210634
## exp_in_company               -0.10086607     0.131590722    0.196785891
## Work_accident                 0.05869724    -0.007104289   -0.004740548
## left                         -0.38837498     0.006567120    0.023787185
## promotion_last_5years         0.02560519    -0.008683768   -0.006063958
##                       average_montly_hours exp_in_company Work_accident
## satisfaction_level            -0.020048113   -0.100866073   0.058697241
## last_evaluation                0.339741800    0.131590722  -0.007104289
## number_project                 0.417210634    0.196785891  -0.004740548
## average_montly_hours           1.000000000    0.127754910  -0.010142888
## exp_in_company                 0.127754910    1.000000000   0.002120418
## Work_accident                 -0.010142888    0.002120418   1.000000000
## left                           0.071287179    0.144822175  -0.154621634
## promotion_last_5years         -0.003544414    0.067432925   0.039245435
##                              left promotion_last_5years
## satisfaction_level    -0.38837498           0.025605186
## last_evaluation        0.00656712          -0.008683768
## number_project         0.02378719          -0.006063958
## average_montly_hours   0.07128718          -0.003544414
## exp_in_company         0.14482217           0.067432925
## Work_accident         -0.15462163           0.039245435
## left                   1.00000000          -0.061788107
## promotion_last_5years -0.06178811           1.000000000

trans<-cor(cor_vars)
melted_cormat <- melt(trans)

ggplot(data = melted_cormat, aes(x=Var1, y=Var2, fill=value)) + 
  geom_tile() +theme(axis.text.x = element_text(angle = 90, hjust = 1))

3b2. Statistical Test for Correlation

One-Sample T-Test (Measuring Satisfaction Level) A one-sample t-test checks whether a sample mean differs from the population mean. Let’s test to see whether the average satisfaction level of employees that had a turnover differs from the entire employee population.

Hypothesis Testing: Is there significant difference in the means of satisfaction level between employees who had a turnover and the entire employee population?

Null Hypothesis: (H0: pTS = pES) The null hypothesis would be that there is no difference in satisfaction level between employees who did turnover and the entire employee population.
Alternate Hypothesis: (HA: pTS != pES) The alternative hypothesis would be that there is a difference in satisfaction level between employees who did turnover and the entire employee population.

# Let's compare the means of our employee turnover satisfaction against the employee population satisfaction
emp_population_satisfaction <-mean(data_set$satisfaction_level)
left_pop<-subset(data_set,left==1)

emp_turnover_satisfaction <-mean(left_pop$satisfaction_level)

print( c('The mean for the employee population is: ', emp_population_satisfaction) )

## [1] "The mean for the employee population is: "
## [2] "0.612833522234816"

print( c('The mean for the employees that had a turnover is: ' ,emp_turnover_satisfaction) )

## [1] "The mean for the employees that had a turnover is: "
## [2] "0.440098011761411"

Conducting the T-Test

Let’s conduct a t-test at 95% confidence level and see if it correctly rejects the null hypothesis that the sample comes from the same distribution as the employee population. To conduct a one sample t-test, we can use the stats.ttest_1samp() function:

t.test(left_pop$satisfaction_level,mu=emp_population_satisfaction) # Employee Population satisfaction mean

## 
##  One Sample t-test
## 
## data:  left_pop$satisfaction_level
## t = -39.109, df = 3570, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0.6128335
## 95 percent confidence interval:
##  0.4314385 0.4487576
## sample estimates:
## mean of x 
##  0.440098

T-Test Result

The test result shows the test statistic “t” is equal to -39.109. This test statistic tells us how much the sample mean deviates from the null hypothesis. If the t-statistic lies outside the quantiles of the t-distribution corresponding to our confidence level and degrees of freedom, we reject the null hypothesis. We can check the quantiles with stats.t.ppf():

T-Test Quantile

If the t-statistic value we calculated above (-39.109) is outside the quantiles, then we can reject the null hypothesis

#degress of freedom
dof<-sum(as.numeric(data_set$left))
LQ <-qt(0.025,dof)  # Left Quartile
RQ <-qt(0.975,dof)  # Right Quartile
print (c('The t-distribution left quartile range is: ',LQ))

## [1] "The t-distribution left quartile range is: "
## [2] "-1.96062852159556"

print (c('The t-distribution right quartile range is: ' ,RQ))

## [1] "The t-distribution right quartile range is: "
## [2] "1.96062852159556"

One-Sample T-Test Summary

T-Test = -39.109 | P-Value = 9.01e-279 | Reject Null Hypothesis Reject the null hypothesis because:

T-Test score is outside the quantiles
P-value is lower than confidence level of 5%

Based on the statistical analysis of a one sample t-test, there seems to be some significant difference between the mean satisfaction of employees who had a turnover and the entire employee population. The super low P-value of 9.012e-279 at a 5% confidence level is a good indicator to reject the null hypothesis.

But this does not neccessarily mean that there is practical significance. We would have to conduct more experiments or maybe collect more data about the employees in order to come up with a more accurate finding.

3c. Distribution Plots (Satisfaction - Evaluation - AverageMonthlyHours)

Summary: Let’s examine the distribution on some of the employee’s features. Here’s what I found:

Satisfaction - There is a huge spike for employees with low satisfaction and high satisfaction.
Evaluation - There is a bimodal distrubtion of employees for low evaluations (less than 0.6) and high evaluations (more than 0.8)
AverageMonthlyHours - There is another bimodal distribution of employees with lower and higher average monthly hours (less than 150 hours & more than 250 hours)
The evaluation and average monthly hour graphs both share a similar distribution.
Employees with lower average monthly hours were evaluated less and vice versa.
If you look back at the correlation matrix, the high correlation between evaluation and averageMonthlyHours does support this finding.

Stop and Think:

Is there a reason for the high spike in low satisfaction of employees? Could employees be grouped in a way with these features? Is there a correlation between evaluation and averageMonthlyHours?

par(mfrow=c(1,3))
hist(data_set$satisfaction_level, col="green")
hist(data_set$last_evaluation, col="red")
hist(data_set$average_montly_hours, col="blue")

3d. Salary V.S. Turnover

Summary: This is not unusual. Here’s what I found:

Majority of employees who left either had low or medium salary.
Barely any employees left with high salary
Employees with low to average salaries tend to leave the company.

Stop and Think:

What is the work environment like for low, medium, and high salaries?
What made employees with high salaries to leave?

vis_1<-table(data_set$salary,data_set$left)
#print(vis_1)
d_vis_1<-as.data.frame(vis_1)
print(d_vis_1)

##     Var1 Var2 Freq
## 1   high    0 1155
## 2    low    0 5144
## 3 medium    0 5129
## 4   high    1   82
## 5    low    1 2172
## 6 medium    1 1317

library(ggplot2)
p<-ggplot(d_vis_1, aes(x=Var1,y=Freq,fill=Var2)) +
 geom_bar(position="dodge",stat='identity') + coord_flip()

print(p)

3e. Department V.S. Turnover

Summary: Let’s see more information about the departments. Here’s what I found:

The sales, technical, and support department were the top 3 departments to have employee turnover
The management department had the smallest amount of turnover

Stop and Think:

If we had more information on each department, can we pinpoint a more direct cause for employee turnover?

vis_2<-table(data_set$role,data_set$left)
d_vis_2<-as.data.frame(vis_2)
d_vis_2<-subset(d_vis_2,Var2==1)
#print(d_vis_2)
library(ggplot2)
d_vis_2$Var1 <- factor(d_vis_2$Var1, levels = d_vis_2$Var1[order(-d_vis_2$Freq)])
p<-ggplot(d_vis_2, aes(x=Var1,y=Freq,fill=Var1)) +
 geom_bar(stat='identity') +theme(axis.text.x = element_text(angle = 90, hjust = 1))

print(p)

3f. Turnover V.S. ProjectCount

Summary: This graph is quite interesting as well. Here’s what I found:

More than half of the employees with 2,6, and 7 projects left the company
Majority of the employees who did not leave the company had 3,4, and 5 projects
All of the employees with 7 projects left the company

There is an increase in employee turnover rate as project count increases

Stop and Think:

Why are employees leaving at the lower/higher spectrum of project counts?
Does this means that employees with project counts 2 or less are not worked hard enough or are not highly valued, thus leaving the company?
Do employees with 6+ projects are getting overworked, thus leaving the company?

vis_3<-table(data_set$number_project,data_set$left)
d_vis_3<-as.data.frame(vis_3)
#print(d_vis_1)
library(ggplot2)
p<-ggplot(d_vis_3, aes(x=Var1,y=Freq,fill=Var2)) +
 geom_bar(position="dodge",stat='identity') + coord_flip()

print(p)

3g. Turnover V.S. Evaluation

Summary:

There is a bimodal distribution for those that had a turnover.
Employees with low performance tend to leave the company more
Employees with high performance tend to leave the company more
The sweet spot for employees that stayed is within 0.6-0.8 evaluation

# Kernel Density Plot
left_data<-subset(data_set,left==1)
stay_data<-subset(data_set,left==0)
ggplot() + geom_density(aes(x=last_evaluation), colour="red", data=left_data) + 
  geom_density(aes(x=last_evaluation), colour="blue", data=stay_data)

3h. Turnover V.S. AverageMonthlyHours

Summary:

Another bi-modal distribution for employees that turnovered
Employees who had less hours of work (~150hours or less) left the company more
Employees who had too many hours of work (~250 or more) left the company
Employees who left generally were underworked or overworked.

#KDEPlot: Kernel Density Estimate Plot

ggplot() + geom_density(aes(x=average_montly_hours), colour="red", data=left_data) + 
  geom_density(aes(x=average_montly_hours), colour="blue", data=stay_data)

3i. Turnover V.S. Satisfaction

Summary:

There is a tri-modal distribution for employees that turnovered
Employees who had really low satisfaction levels (0.2 or less) left the company more
Employees who had low satisfaction levels (0.3~0.5) left the company more
Employees who had really high satisfaction levels (0.7 or more) left the company more

#KDEPlot: Kernel Density Estimate Plot
ggplot() + geom_density(aes(x=satisfaction_level), colour="red", data=left_data) + 
  geom_density(aes(x=satisfaction_level), colour="blue", data=stay_data)

3j. ProjectCount VS AverageMonthlyHours

Summary:

As project count increased, so did average monthly hours
Something weird about the boxplot graph is the difference in averageMonthlyHours between people who had a turnver and did not.
Looks like employees who did not have a turnover had consistent averageMonthlyHours, despite the increase in projects
In contrast, employees who did have a turnover had an increase in averageMonthlyHours with the increase in projects

Stop and Think:

What could be the meaning for this?
Why is it that employees who left worked more hours than employees who didn’t, even with the same project count?

#ProjectCount VS AverageMonthlyHours [BOXPLOT]
#Looks like the average employees who stayed worked about 200hours/month. Those that had a turnover worked about 250hours/month and 150hours/month

library(ggplot2)
p<-ggplot(data_set, aes(x = factor(number_project), y = average_montly_hours, fill = factor(left))) +
  geom_boxplot() + scale_fill_manual(values = c("yellow", "orange"))
print(p)

3k. ProjectCount VS Evaluation

Summary: This graph looks very similar to the graph above. What I find strange with this graph is with the turnover group. There is an increase in evaluation for employees who did more projects within the turnover group. But, again for the non-turnover group, employees here had a consistent evaluation score despite the increase in project counts.

*Questions to think about:*

Why is it that employees who left, had on average, a higher evaluation than employees who did not leave, even with an increase in project count?
Shouldn’t employees with lower evaluations tend to leave the company more?

#ProjectCount VS Evaluation
#Looks like employees who did not leave the company had an average evaluation of around 70% even with different projectCounts
#There is a huge skew in employees who had a turnover though. It drastically changes after 3 projectCounts. 
#Employees that had two projects and a horrible evaluation left. Employees with more than 3 projects and super high evaluations left
p<-ggplot(data_set, aes(x = factor(number_project), y = last_evaluation, fill = factor(left))) +
  geom_boxplot() + scale_fill_manual(values = c("yellow", "orange"))
print(p)

3l. Satisfaction VS Evaluation

Summary: This is by far the most compelling graph. This is what I found:

*There are 3 distinct clusters for employees who left the company

Cluster 1 (Hard-working and Sad Employee): Satisfaction was below 0.2 and evaluations were greater than 0.75. Which could be a good indication that employees who left the company were good workers but felt horrible at their job.

Question: What could be the reason for feeling so horrible when you are highly evaluated? Could it be working too hard? Could this cluster mean employees who are “overworked”?

Cluster 2 (Bad and Sad Employee): Satisfaction between about 0.35~0.45 and evaluations below ~0.58. This could be seen as employees who were badly evaluated and felt bad at work.

Question: Could this cluster mean employees who “under-performed”?

Cluster 3 (Hard-working and Happy Employee): Satisfaction between 0.7~1.0 and evaluations were greater than 0.8. Which could mean that employees in this cluster were “ideal”. They loved their work and were evaluated highly for their performance.

Question: Could this cluser mean that employees left because they found another job opportunity?

library(ggplot2)
ggplot(data_set, aes(satisfaction_level, last_evaluation, color = left)) +
  geom_point(shape = 16, size = 5, show.legend = FALSE) +
  theme_minimal() +
  scale_color_gradient(low = "#0091ff", high = "#f0650e")

3m. Feature Importance

BORUTA

Boruta is a feature selection algorithm. Precisely, it works as a wrapper algorithm around Random Forest. This package derive its name from a demon in Slavic mythology who dwelled in pine forests. Feature selection is a crucial step in predictive modeling. This technique achieves supreme importance when a data set comprised of several variables is given for model building.

Boruta can be your algorithm of choice to deal with such data sets. Particularly when one is interested in understanding the mechanisms related to the variable of interest, rather than just building a black box predictive model with good prediction accuracy.

How does it work?

Below is the step wise working of boruta algorithm:

Firstly, it adds randomness to the given data set by creating shuffled copies of all features (which are called shadow features). Then, it trains a random forest classifier on the extended data set and applies a feature importance measure (the default is Mean Decrease Accuracy) to evaluate the importance of each feature where higher means more important.

At every iteration, it checks whether a real feature has a higher importance than the best of its shadow features (i.e. whether the feature has a higher Z score than the maximum Z score of its shadow features) and constantly removes features which are deemed highly unimportant.

Finally, the algorithm stops either when all features gets confirmed or rejected or it reaches a specified limit of random forest runs.

library(Boruta)

## Loading required package: ranger

data_set$left<-as.factor(data_set$left)
boruta.train <- Boruta(left~., data = data_set, doTrace = 2)

##  1. run of importance source...

##  2. run of importance source...

##  3. run of importance source...

##  4. run of importance source...

##  5. run of importance source...

##  6. run of importance source...

##  7. run of importance source...

##  8. run of importance source...

##  9. run of importance source...

##  10. run of importance source...

## After 10 iterations, +1.9 mins:

##  confirmed 9 attributes: average_montly_hours, exp_in_company, last_evaluation, number_project, promotion_last_5years and 4 more;

##  no more attributes left.

print(boruta.train)

## Boruta performed 10 iterations in 1.940593 mins.
##  9 attributes confirmed important: average_montly_hours,
## exp_in_company, last_evaluation, number_project,
## promotion_last_5years and 4 more;
##  No attributes deemed unimportant.

plot(boruta.train, xlab = "", xaxt = "n")

lz<-lapply(1:ncol(boruta.train$ImpHistory),function(i)
boruta.train$ImpHistory[is.finite(boruta.train$ImpHistory[,i]),i])
names(lz) <- colnames(boruta.train$ImpHistory)
Labels <- sort(sapply(lz,median))
axis(side = 1,las=2,labels = names(Labels),
at = 1:ncol(boruta.train$ImpHistory), cex.axis = 0.7)

#library(Boruta)
#train$left<-as.factor(train$left)
#boruta.train <- Boruta(left~., data = train, doTrace = 2)

#print(boruta.train)

#plot(boruta.train, xlab = "", xaxt = "n")

#lz<-lapply(1:ncol(boruta.train$ImpHistory),function(i)
#boruta.train$ImpHistory[is.finite(boruta.train$ImpHistory[,i]),i])
#names(lz) <- colnames(boruta.train$ImpHistory)
#Labels <- sort(sapply(lz,median))
#axis(side = 1,las=2,labels = names(Labels),
#at = 1:ncol(boruta.train$ImpHistory), cex.axis = 0.7)

Key Observations: The above graph clearly represents the factors which serve as the top reasons for attrition in a company:

Satisfaction level: it already had a negative corellation with the outcome. People with low satisfaction were most likely to leave even when compared with evaluations(Evident cluster was formed with respect to low satisfaction)
Salary and the role they played has one of the least impact on attrition
Pressure due to the number of projects and how they were evaluated also holds key significance in determining attrition

4a. Modeling the Data: Logistic Regression Analysis

Logistic Regression commonly deals with the issue of how likely an observation is to belong to each group. This model is commonly used to predict the likelihood of an event occurring. In contrast to linear regression, the output of logistic regression is transformed with a logit function. This makes the output either 0 or 1. This is a useful model to take advantage of for this problem because we are interested in predicting whether an employee will leave (0) or stay (1).

Another reason for why logistic regression is the preferred model of choice is because of its interpretability. Logistic regression predicts the outcome of the response variable (turnover) through a set of other explanatory variables, also called predictors. In context of this domain, the value of our response variable is categorized into two forms: 0 (zero) or 1 (one). The value of 0 (zero) represents the probability of an employee not leaving the company and the value of 1 (one) represents the probability of an employee leaving the company.

The equation above shows the relationship between, the dependent variable (success), denoted as (Î¸) and independent variables or predictor of event, denoted as xi. Where Î± is the constant of the equation and, Î² is the coefficient of the predictor variables

#Creating training and test sets for the logistic regression
smp_size <- floor(0.75 * nrow(data_set))

## set the seed to make your partition reproductible
set.seed(123)
train_ind <- sample(seq_len(nrow(data_set)), size = smp_size)

train <- data_set[train_ind, ]
test <- data_set[-train_ind, ]

dim(test)

## [1] 3750   10

dim(train)

## [1] 11249    10

library(gmodels)
library (Hmisc)

## Loading required package: lattice

## Loading required package: survival

## Loading required package: Formula

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:plyr':
## 
##     is.discrete, summarize

## The following objects are masked from 'package:base':
## 
##     format.pval, units

library (caTools)
library (ROCR)

## Loading required package: gplots

## 
## Attaching package: 'gplots'

## The following object is masked from 'package:stats':
## 
##     lowess

logit_model<-glm(left~satisfaction_level+last_evaluation+average_montly_hours+salary+role+number_project,data=train,binomial())

summary(logit_model)

## 
## Call:
## glm(formula = left ~ satisfaction_level + last_evaluation + average_montly_hours + 
##     salary + role + number_project, family = binomial(), data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8701  -0.6965  -0.4571  -0.1801   2.8414  
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          -0.7889072  0.2090275  -3.774 0.000161 ***
## satisfaction_level   -4.1663664  0.1108697 -37.579  < 2e-16 ***
## last_evaluation       0.8050924  0.1653468   4.869 1.12e-06 ***
## average_montly_hours  0.0048773  0.0005781   8.437  < 2e-16 ***
## salarylow             1.7702386  0.1401465  12.631  < 2e-16 ***
## salarymedium          1.2773270  0.1414040   9.033  < 2e-16 ***
## rolehr                0.0576180  0.1460020   0.395 0.693110    
## roleIT               -0.2350453  0.1353532  -1.737 0.082470 .  
## rolemanagement       -0.3395424  0.1745319  -1.945 0.051721 .  
## rolemarketing        -0.0444521  0.1451552  -0.306 0.759423    
## roleproduct_mng      -0.2174759  0.1456506  -1.493 0.135402    
## roleRandD            -0.6158662  0.1599300  -3.851 0.000118 ***
## rolesales            -0.1107893  0.1127533  -0.983 0.325814    
## rolesupport          -0.0168767  0.1202623  -0.140 0.888398    
## roletechnical        -0.0038007  0.1173905  -0.032 0.974172    
## number_project       -0.2601256  0.0236473 -11.000  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 12345  on 11248  degrees of freedom
## Residual deviance: 10184  on 11233  degrees of freedom
## AIC: 10216
## 
## Number of Fisher Scoring iterations: 5

test$logit_model<-predict(logit_model,test)
#head(test)

colAUC(test$logit_model,test$left, plotROC=TRUE)

##              [,1]
## 0 vs. 1 0.7837071

#An approach to identify the cut-off for the predicted probabilities 
#is to use a binned table of the proababilities. See the exact threshold 
#where 0's and 1's are getting classified with high precision and recall
#you can use the two commented lines below to get the threshold manually
#test$logit_model_bin <- cut2(test$logit_model,g=12)

#CrossTable(test$left, test$logit_model_bin,prop.chisq=FALSE,prop.r=FALSE,prop.t=FALSE)

#Now using that threshold created the predicted values for each record
test$prediction<-ifelse(test$logit_model>=-.95,1,0)

#Make use of the confusion matrix to calculate accuracy, precision and recall
#CrossTable(test$left, test$prediction,prop.chisq=FALSE,prop.r=FALSE,prop.t=FALSE)
conf_mat<-table(test$left,test$prediction)

#print(conf_mat)
#class(conf_mat)
accuracy<-(conf_mat[1,1]+conf_mat[2,2])/(conf_mat[1,1]+conf_mat[2,2]+conf_mat[1,2]+conf_mat[2,1])
recall<-(conf_mat[2,2])/(conf_mat[1,2]+conf_mat[2,2])
precision<-(conf_mat[2,2])/(conf_mat[2,2]+conf_mat[2,1])

print(c("Accuracy:",accuracy))

## [1] "Accuracy:" "0.7616"

print(c("Precision:",precision))

## [1] "Precision:"        "0.695749440715884"

print(c("Recall:",recall))

## [1] "Recall:" "0.5"

#red <- prediction(test$prediction, test$left)
#P.perf <- performance(pred, "prec", "rec")
#lot (RP.perf)

4b. Modeling the Data

The best model performance out of the four (Decision Tree Model, AdaBoost Model, Logistic Regression Model, Random Forest Model) was Random Forest!

Note: Base Rate

A Base Rate Model is a model that always selects the target variable’s majority class. It’s just used for reference to compare how better another model is against it. In this dataset, the majority class that will be predicted will be 0’s, which are employees who did not leave the company.
If you recall back to Part 3: Exploring the Data, 24% of the dataset contained 1’s (employee who left the company) and the remaining 76% contained 0’s (employee who did not leave the company). The Base Rate Model would simply predict every 0’s and ignore all the 1’s.
Example: The base rate accuracy for this data set, when classifying everything as 0’s, would be 76% because 76% of the dataset are labeled as 0’s (employees not leaving the company).

Note: Different Ways to Evaluate Classification Models

Predictive Accuracy: How many does it get right?
Speed: How fast does it take for the model to deploy?
Scalability: Can the model handle large datasets?
Robustness: How well does the model handle outliers/missing values?
Interpretability: Is the model easy to understand?

Caption for the picture.

10 Fold Cross Validation for Logistic Regression**Â¶

# load the library
library(caret)

## 
## Attaching package: 'caret'

## The following object is masked from 'package:survival':
## 
##     cluster

# load the iris dataset

# define training control
train_control <- trainControl(method="cv", number=10)
train$left<-as.factor(train$left)
# fix the parameters of the algorithm
grid <- expand.grid()
# train the model
model <- train(left~., data=train, trControl=train_control, method="glm",family=binomial())
#model <- train(left~satisfaction_level+last_evaluation+number_project+exp_in_company+average_montly_hours, data=train, trControl=train_control, method="glm",family=binomial())
# summarize results
print(model)

## Generalized Linear Model 
## 
## 11249 samples
##     9 predictor
##     2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 10125, 10124, 10124, 10124, 10124, 10125, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.7884257  0.3188393

pred_log<-predict(model, newdata = test, type = "prob")
pred_log_c<-predict(model, newdata = test, type = "raw")

test_log_model<-pred_log[,2]
test_log_model_1<-data.frame(test_log_model)


colAUC(test_log_model_1$test_log_model,test$left, plotROC=TRUE)

##              [,1]
## 0 vs. 1 0.8202086

conf_mat<-table(as.factor(pred_log_c),as.factor(test$left))

print(conf_mat)

##    
##        0    1
##   0 2665  574
##   1  191  320

#class(conf_mat)
accuracy<-(conf_mat[1,1]+conf_mat[2,2])/(conf_mat[1,1]+conf_mat[2,2]+conf_mat[1,2]+conf_mat[2,1])
recall<-(conf_mat[2,2])/(conf_mat[1,2]+conf_mat[2,2])
precision<-(conf_mat[2,2])/(conf_mat[2,2]+conf_mat[2,1])

print(c("Accuracy:",accuracy))

## [1] "Accuracy:" "0.796"

print(c("Precision:",precision))

## [1] "Precision:"        "0.626223091976517"

print(c("Recall:",recall))

## [1] "Recall:"           "0.357941834451902"

Logistic Regression V.S. Random Forest V.S. Decision Tree V.S. AdaBoost Model

# NOTE: By adding in "class_weight = balanced", the Logistic Auc increased by about 10%! This adjusts the threshold value


# Decision Tree Model
library(rpart.plot)

## Loading required package: rpart

## 
## Attaching package: 'rpart'

## The following object is masked from 'package:survival':
## 
##     solder

trctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 3)
set.seed(3333)
train$left<-as.factor(train$left)
dtree_fit <- train(left ~., data = train, method = "rpart",
                   parms = list(split = "information"),
                   trControl=trctrl,
                   tuneLength = 10)

#plot decision tree
p<-prp(dtree_fit$finalModel, box.palette = "Reds", tweak = 1.2)

print(p)

## $obj
## n= 11249 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 11249 2677 0 (0.76202329 0.23797671)  
##    2) satisfaction_level>=0.465 8118  785 0 (0.90330131 0.09669869)  
##      4) exp_in_company< 4.5 6629   99 0 (0.98506562 0.01493438) *
##      5) exp_in_company>=4.5 1489  686 0 (0.53928811 0.46071189)  
##       10) last_evaluation< 0.805 574   24 0 (0.95818815 0.04181185) *
##       11) last_evaluation>=0.805 915  253 1 (0.27650273 0.72349727)  
##         22) exp_in_company>=6.5 126    0 0 (1.00000000 0.00000000) *
##         23) exp_in_company< 6.5 789  127 1 (0.16096324 0.83903676)  
##           46) average_montly_hours< 214 85   14 0 (0.83529412 0.16470588) *
##           47) average_montly_hours>=214 704   56 1 (0.07954545 0.92045455)  
##             94) satisfaction_level< 0.705 35    7 0 (0.80000000 0.20000000) *
##             95) satisfaction_level>=0.705 669   28 1 (0.04185351 0.95814649) *
##    3) satisfaction_level< 0.465 3131 1239 1 (0.39572022 0.60427978)  
##      6) number_project>=2.5 1847  755 0 (0.59122902 0.40877098)  
##       12) satisfaction_level>=0.115 1183   91 0 (0.92307692 0.07692308) *
##       13) satisfaction_level< 0.115 664    0 1 (0.00000000 1.00000000) *
##      7) number_project< 2.5 1284  147 1 (0.11448598 0.88551402)  
##       14) last_evaluation>=0.575 87    5 0 (0.94252874 0.05747126) *
##       15) last_evaluation< 0.575 1197   65 1 (0.05430242 0.94569758)  
##         30) average_montly_hours>=162 36    6 0 (0.83333333 0.16666667) *
##         31) average_montly_hours< 162 1161   35 1 (0.03014643 0.96985357)  
##           62) average_montly_hours< 125.5 17    0 0 (1.00000000 0.00000000) *
##           63) average_montly_hours>=125.5 1144   18 1 (0.01573427 0.98426573) *
## 
## $snipped.nodes
## NULL
## 
## $xlim
## [1] 0 1
## 
## $ylim
## [1] 0 1
## 
## $x
##  [1] 0.36494386 0.10121162 0.02625852 0.17616472 0.10121837 0.25111108
##  [7] 0.17623218 0.32598997 0.25178566 0.40019428 0.33273582 0.46765275
## [13] 0.62867610 0.46846225 0.40100378 0.53592071 0.74571654 0.67083764
## [19] 0.82059543 0.74639112 0.89479974 0.82734128 0.96225821
## 
## $y
##  [1] 0.95182258 0.79547965 0.63913671 0.63913671 0.48279378 0.48279378
##  [7] 0.32645084 0.32645084 0.17010791 0.17010791 0.01376498 0.01376498
## [13] 0.79547965 0.63913671 0.48279378 0.48279378 0.63913671 0.48279378
## [19] 0.48279378 0.32645084 0.32645084 0.17010791 0.17010791
## 
## $branch.x
##        [,1]      [,2]       [,3]      [,4]      [,5]      [,6]      [,7]
## x 0.3649439 0.1012116 0.02625852 0.1761647 0.1012184 0.2511111 0.1762322
##          NA 0.3121974 0.08622100 0.1162022 0.1611755 0.1911540 0.2361353
##          NA 0.3649439 0.10121162 0.1012116 0.1761647 0.1761647 0.2511111
##        [,8]      [,9]     [,10]     [,11]     [,12]     [,13]     [,14]
## x 0.3259900 0.2517857 0.4001943 0.3327358 0.4676527 0.6286761 0.4684622
##   0.2660869 0.3111491 0.3408308 0.3867026 0.4136860 0.4176903 0.5793640
##   0.2511111 0.3259900 0.3259900 0.4001943 0.4001943 0.3649439 0.6286761
##       [,15]     [,16]     [,17]     [,18]     [,19]     [,20]     [,21]
## x 0.4010038 0.5359207 0.7457165 0.6708376 0.8205954 0.7463911 0.8947997
##   0.4549706 0.4819539 0.6348148 0.7307408 0.7606923 0.8057546 0.8354363
##   0.4684622 0.4684622 0.6286761 0.7457165 0.7457165 0.8205954 0.8205954
##       [,22]     [,23]
## x 0.8273413 0.9622582
##   0.8813081 0.9082914
##   0.8947997 0.8947997
## 
## $branch.y
##        [,1]      [,2]      [,3]      [,4]      [,5]      [,6]      [,7]
## y 0.9575063 0.8011633 0.6904954 0.6448204 0.5341525 0.4884775 0.3778095
##          NA 0.9518226 0.7954796 0.7954796 0.6391367 0.6391367 0.4827938
##          NA 0.9518226 0.7954796 0.7954796 0.6391367 0.6391367 0.4827938
##        [,8]      [,9]     [,10]      [,11]      [,12]     [,13]     [,14]
## y 0.3321345 0.2214666 0.1757916 0.06512368 0.06512368 0.8011633 0.6448204
##   0.4827938 0.3264508 0.3264508 0.17010791 0.17010791 0.9518226 0.7954796
##   0.4827938 0.3264508 0.3264508 0.17010791 0.17010791 0.9518226 0.7954796
##       [,15]     [,16]     [,17]     [,18]     [,19]     [,20]     [,21]
## y 0.5341525 0.5341525 0.6448204 0.5341525 0.4884775 0.3778095 0.3321345
##   0.6391367 0.6391367 0.7954796 0.6391367 0.6391367 0.4827938 0.4827938
##   0.6391367 0.6391367 0.7954796 0.6391367 0.6391367 0.4827938 0.4827938
##       [,22]     [,23]
## y 0.2214666 0.2214666
##   0.3264508 0.3264508
##   0.3264508 0.3264508
## 
## $labs
##  [1] NA  NA  "0" NA  "0" NA  "0" NA  "0" NA  "0" "1" NA  NA  "0" "1" NA 
## [18] "0" NA  "0" NA  "0" "1"
## 
## $cex
## [1] 1.2
## 
## $boxes
## $boxes$x1
##  [1]          NA          NA 0.004099429          NA 0.079059275
##  [6]          NA 0.154073087          NA 0.229626567          NA
## [11] 0.310576725 0.445493654          NA          NA 0.378844691
## [16] 0.513761620          NA 0.648678548          NA 0.724232029
## [21]          NA 0.805182186 0.940099115
## 
## $boxes$y1
##  [1]          NA          NA  0.62201715          NA  0.46567421
##  [6]          NA  0.30933128          NA  0.15298834          NA
## [11] -0.00335459 -0.00335459          NA          NA  0.46567421
## [16]  0.46567421          NA  0.46567421          NA  0.30933128
## [21]          NA  0.15298834  0.15298834
## 
## $boxes$x2
##  [1]         NA         NA 0.04841761         NA 0.12337746         NA
##  [7] 0.19839127         NA 0.27394475         NA 0.35489491 0.48981184
## [13]         NA         NA 0.42316288 0.55807980         NA 0.69299673
## [19]         NA 0.76855021         NA 0.84950037 0.98441730
## 
## $boxes$y2
##  [1]         NA         NA 0.69049541         NA 0.53415248         NA
##  [7] 0.37780955         NA 0.22146661         NA 0.06512368 0.06512368
## [13]         NA         NA 0.53415248 0.53415248         NA 0.53415248
## [19]         NA 0.37780955         NA 0.22146661 0.22146661
## 
## 
## $split.labs
## [1] ""
## 
## $split.cex
##  [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 
## $split.box
## $split.box$x1
##  [1]  0.23071090 -0.02322021          NA  0.04789767          NA
##  [6]  0.11730425          NA  0.19644451          NA  0.28428518
## [11]          NA          NA  0.48549427  0.33422928          NA
## [16]          NA  0.60807448          NA  0.68167497          NA
## [21]  0.76525428          NA          NA
## 
## $split.box$y1
##  [1] 0.9347030 0.7783601        NA 0.6220171        NA 0.4656742        NA
##  [8] 0.3093313        NA 0.1529883        NA        NA 0.7783601 0.6220171
## [15]        NA        NA 0.6220171        NA 0.4656742        NA 0.3093313
## [22]        NA        NA
## 
## $split.box$x2
##  [1] 0.4991768 0.2256434        NA 0.3044318        NA 0.3849179        NA
##  [8] 0.4555354        NA 0.5161034        NA        NA 0.7718579 0.6026952
## [15]        NA        NA 0.8833586        NA 0.9595159        NA 1.0243452
## [22]        NA        NA
## 
## $split.box$y2
##  [1] 1.0031813 0.8468383        NA 0.6904954        NA 0.5341525        NA
##  [8] 0.3778095        NA 0.2214666        NA        NA 0.8468383 0.6904954
## [15]        NA        NA 0.6904954        NA 0.5341525        NA 0.3778095
## [22]        NA        NA

pred_d3<-predict(dtree_fit, newdata = test, type = "prob")
pred_d3_c<-predict(dtree_fit, newdata = test, type = "raw")




test_rpart_model<-pred_d3[,2]
test_rpart_model_1<-data.frame(test_rpart_model)
str(test_rpart_model_1)

## 'data.frame':    3750 obs. of  1 variable:
##  $ test_rpart_model: num  0.958 1 0.958 0.984 1 ...

colAUC(test_rpart_model_1$test_rpart_model,test$left, plotROC=TRUE)

##              [,1]
## 0 vs. 1 0.9742972

conf_mat<-table(as.factor(pred_d3_c),as.factor(test$left))

print(conf_mat)

##    
##        0    1
##   0 2831   69
##   1   25  825

#class(conf_mat)
accuracy<-(conf_mat[1,1]+conf_mat[2,2])/(conf_mat[1,1]+conf_mat[2,2]+conf_mat[1,2]+conf_mat[2,1])
recall<-(conf_mat[2,2])/(conf_mat[1,2]+conf_mat[2,2])
precision<-(conf_mat[2,2])/(conf_mat[2,2]+conf_mat[2,1])

print(c("Accuracy:",accuracy))

## [1] "Accuracy:"         "0.974933333333333"

print(c("Precision:",precision))

## [1] "Precision:"        "0.970588235294118"

print(c("Recall:",recall))

## [1] "Recall:"           "0.922818791946309"

# Random Forest Model
library(rpart.plot)
library(caret)
train$left<-as.factor(train$left)

#print("uncomment code and run. It generally kills the kernel because of the time taken by Random forest")
#ctrl = trainControl(method="repeatedcv", number=10, repeats=5, selectionFunction ="oneSE")


#rf_model<-train(left~.,data=train,method="rf",trControl=ctrl,prox=TRUE,allowParallel=TRUE)
library(randomForest)

## randomForest 4.6-14

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:ranger':
## 
##     importance

## The following object is masked from 'package:ggplot2':
## 
##     margin

train$left<-as.factor(train$left)
rf_model <- randomForest(left~.,data=train)
print("random forest")

## [1] "random forest"

# View the forest results.
print(rf_model)

## 
## Call:
##  randomForest(formula = left ~ ., data = train) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 0.86%
## Confusion matrix:
##      0    1 class.error
## 0 8561   11 0.001283248
## 1   86 2591 0.032125514

# Importance of each predictor.
print(importance(rf_model,type = 2))

##                       MeanDecreaseGini
## satisfaction_level         1344.528758
## last_evaluation             488.942702
## number_project              745.018203
## average_montly_hours        586.455485
## exp_in_company              768.389375
## Work_accident                22.615682
## promotion_last_5years         4.144949
## role                         64.040424
## salary                       32.328103

varImpPlot(rf_model)

#Prediction(test)
test$rf_model<-predict(rf_model,test, type="prob")
test$rf_model_c<-predict(rf_model,test, type="response")

test_rf_model<-test$rf_model[,2]
test_rf_model_1<-data.frame(test_rf_model)

colAUC(test_rf_model_1$test_rf_model,test$left, plotROC=TRUE)

##              [,1]
## 0 vs. 1 0.9964115

conf_mat<-table(as.factor(test$rf_model_c),as.factor(test$left))



#print(conf_mat)
#class(conf_mat)
accuracy<-(conf_mat[1,1]+conf_mat[2,2])/(conf_mat[1,1]+conf_mat[2,2]+conf_mat[1,2]+conf_mat[2,1])
recall<-(conf_mat[2,2])/(conf_mat[1,2]+conf_mat[2,2])
precision<-(conf_mat[2,2])/(conf_mat[2,2]+conf_mat[2,1])

print(c("Accuracy:",accuracy))

## [1] "Accuracy:"         "0.992533333333333"

print(c("Precision:",precision))

## [1] "Precision:"        "0.992045454545455"

print(c("Recall:",recall))

## [1] "Recall:"           "0.976510067114094"

library(fastAdaboost)

## Warning: package 'fastAdaboost' was built under R version 3.5.3

mod_adaboost <- adaboost(left~.,data=train, 10)
pred <- predict(mod_adaboost,newdata=test)
print(table(factor(pred$class),factor(test$left)))

##    
##        0    1
##   0 2830   20
##   1   26  874

colAUC(pred$prob[,2],factor(test$left), plotROC=TRUE)

##              [,1]
## 0 vs. 1 0.9937249

conf_mat<-table(factor(pred$class),factor(test$left))
accuracy<-(conf_mat[1,1]+conf_mat[2,2])/(conf_mat[1,1]+conf_mat[2,2]+conf_mat[1,2]+conf_mat[2,1])
recall<-(conf_mat[2,2])/(conf_mat[1,2]+conf_mat[2,2])
precision<-(conf_mat[2,2])/(conf_mat[2,2]+conf_mat[2,1])

print(c("Accuracy:",accuracy))

## [1] "Accuracy:"         "0.987733333333333"

print(c("Precision:",precision))

## [1] "Precision:"        "0.971111111111111"

print(c("Recall:",recall))

## [1] "Recall:"           "0.977628635346756"

5. Interpreting the Data

Summary: With all of this information, this is what Bob should know about his company and why his employees probably left:

Employees generally left when they are underworked (less than 150hr/month or 6hr/day)
Employees generally left when they are overworked (more than 250hr/month or 10hr/day)
Employees with either really high or low evaluations should be taken into consideration for high turnover rate
Employees with low to medium salaries are the bulk of employee turnover
Employees that had 2,6, or 7 project count was at risk of leaving the company
Employee satisfaction is the highest indicator for employee turnover.
Employee that had 4 and 5 yearsAtCompany should be taken into consideration for high turnover rate

library(DALEX)

## Warning: package 'DALEX' was built under R version 3.5.3

## Welcome to DALEX (version: 0.3.0).
## This is a plain DALEX. Use 'install_dependencies()' to get all required packages.

library(breakDown)
# Create a DALEX explainer
explainer_rf <- explain(rf_model,data=train)
colnames(test)

##  [1] "satisfaction_level"    "last_evaluation"      
##  [3] "number_project"        "average_montly_hours" 
##  [5] "exp_in_company"        "Work_accident"        
##  [7] "left"                  "promotion_last_5years"
##  [9] "role"                  "salary"               
## [11] "logit_model"           "prediction"           
## [13] "rf_model"              "rf_model_c"

# Calculate variable attributions
new_observation <- test[3,-7]
library("breakDown")
bd_rf <- prediction_breakdown(explainer_rf,new_observation)
 
bd_rf

##                                           variable  contribution
## 1                                      (Intercept)  2.397305e-01
## satisfaction_level     + satisfaction_level = 0.89 -1.360821e-01
## average_montly_hours  + average_montly_hours = 224 -9.141968e-03
## role                                + role = sales -5.387679e-03
## promotion_last_5years  + promotion_last_5years = 0  7.858476e-05
## Work_accident                  + Work_accident = 0  8.836341e-05
## salary                              + salary = low  7.407947e-03
## number_project                + number_project = 5  1.538590e-02
## exp_in_company                + exp_in_company = 5  3.086594e-01
## last_evaluation              + last_evaluation = 1  5.792611e-01
## 11                                 final_prognosis  1.000000e+00
##                               variable_name variable_value cummulative
## 1                                 Intercept              1  0.23973046
## satisfaction_level       satisfaction_level           0.89  0.10364832
## average_montly_hours   average_montly_hours            224  0.09450636
## role                                   role          sales  0.08911868
## promotion_last_5years promotion_last_5years              0  0.08919726
## Work_accident                 Work_accident              0  0.08928563
## salary                               salary            low  0.09669357
## number_project               number_project              5  0.11207947
## exp_in_company               exp_in_company              5  0.42073891
## last_evaluation             last_evaluation              1  1.00000000
## 11                                                          1.00000000
##                       sign position        label
## 1                        1        1 randomForest
## satisfaction_level      -1        2 randomForest
## average_montly_hours    -1        3 randomForest
## role                    -1        4 randomForest
## promotion_last_5years    1        5 randomForest
## Work_accident            1        6 randomForest
## salary                   1        7 randomForest
## number_project           1        8 randomForest
## exp_in_company           1        9 randomForest
## last_evaluation          1       10 randomForest
## 11                       X       11 randomForest

plot(bd_rf)

People Analytics - A guide to using R for breaking down employee data and understanding Attrition

Heru Wiryanto

April 25, 2019

Preview

Part 1: Obtaining the Data

Part 2: Data Cleaning

Renaming the irrelevant Variable name

Part 3: Exploring the Data

3a. Statistical Overview

3b. Correlation Matrix & Heatmap

3b2. Statistical Test for Correlation

Conducting the T-Test

T-Test Result

T-Test Quantile

One-Sample T-Test Summary

3c. Distribution Plots (Satisfaction - Evaluation - AverageMonthlyHours)

3d. Salary V.S. Turnover

3e. Department V.S. Turnover

3f. Turnover V.S. ProjectCount

3g. Turnover V.S. Evaluation

3h. Turnover V.S. AverageMonthlyHours

3i. Turnover V.S. Satisfaction

3j. ProjectCount VS AverageMonthlyHours

3k. ProjectCount VS Evaluation

3l. Satisfaction VS Evaluation

3m. Feature Importance

BORUTA

4a. Modeling the Data: Logistic Regression Analysis

4b. Modeling the Data

10 Fold Cross Validation for Logistic Regression**Â¶

Logistic Regression V.S. Random Forest V.S. Decision Tree V.S. AdaBoost Model

5. Interpreting the Data