We providing the equivalent R-code for the same exercise with some additional empahsis on a few extra steps like -
The code follows the standard ML pipeline of Preprocessing, exploration, modelling, iterating and insights. Do feel free to raise your questions.. There is a lot of emphasis on the visualization aspect to understand the data.
#Read the analytics csv file and store our dataset into a dataframe called "df"
library(readr)
## Warning: package 'readr' was built under R version 3.5.2
data_set<-read.csv('D:/Back up May 2018/Documents/Rattle-data/HR_comma_sep1.csv', sep=";")
Always better to run some basic checks and see if there are missing values or unusual MAX Values in every variable. The data aint gonna be friendly.
# Check to see if there are any missing values in our data and checking overall summary
summary(data_set)
## satisfaction_level last_evaluation number_project average_montly_hours
## Min. :0.0900 Min. :0.3600 Min. :2.000 Min. : 96.0
## 1st Qu.:0.4400 1st Qu.:0.5600 1st Qu.:3.000 1st Qu.:156.0
## Median :0.6400 Median :0.7200 Median :4.000 Median :200.0
## Mean :0.6128 Mean :0.7161 Mean :3.803 Mean :201.1
## 3rd Qu.:0.8200 3rd Qu.:0.8700 3rd Qu.:5.000 3rd Qu.:245.0
## Max. :1.0000 Max. :1.0000 Max. :7.000 Max. :310.0
##
## time_spend_company Work_accident left promotion_last_5years
## Min. : 2.000 Min. :0.0000 NO :11428 Min. :0.00000
## 1st Qu.: 3.000 1st Qu.:0.0000 YES: 3571 1st Qu.:0.00000
## Median : 3.000 Median :0.0000 Median :0.00000
## Mean : 3.498 Mean :0.1446 Mean :0.02127
## 3rd Qu.: 4.000 3rd Qu.:0.0000 3rd Qu.:0.00000
## Max. :10.000 Max. :1.0000 Max. :1.00000
##
## sales salary
## sales :4140 high :1237
## technical :2720 low :7316
## support :2229 medium:6446
## IT :1227
## product_mng: 902
## marketing : 858
## (Other) :2923
The only variable which doesnt represent what it stands is the “Sales” variable which clearly represents Job Role
#Data overview Just so that the summary is not good enough to get an idea of the data into your head
#Renaming dataset
library(plyr)
library(knitr)
## Warning: package 'knitr' was built under R version 3.5.2
data_set<-rename(data_set, c("sales"="role"))
data_set<-rename(data_set, c("time_spend_company"="exp_in_company"))
names(data_set)[10]<-"salary"
kable(head(data_set))
satisfaction_level | last_evaluation | number_project | average_montly_hours | exp_in_company | Work_accident | left | promotion_last_5years | role | salary |
---|---|---|---|---|---|---|---|---|---|
0.38 | 0.53 | 2 | 157 | 3 | 0 | YES | 0 | sales | low |
0.80 | 0.86 | 5 | 262 | 6 | 0 | YES | 0 | sales | medium |
0.11 | 0.88 | 7 | 272 | 4 | 0 | YES | 0 | sales | medium |
0.72 | 0.87 | 5 | 223 | 5 | 0 | YES | 0 | sales | low |
0.37 | 0.52 | 2 | 159 | 3 | 0 | YES | 0 | sales | low |
0.41 | 0.50 | 2 | 153 | 3 | 0 | YES | 0 | sales | low |
The dataset has:
# The dataset contains 10 columns and 14999 observations
dim(data_set)
## [1] 14999 10
# Check the type of our features.
data_set$left<-as.numeric(data_set$left)
data_set$left<-data_set$left-1
str(data_set)
## 'data.frame': 14999 obs. of 10 variables:
## $ satisfaction_level : num 0.38 0.8 0.11 0.72 0.37 0.41 0.1 0.92 0.89 0.42 ...
## $ last_evaluation : num 0.53 0.86 0.88 0.87 0.52 0.5 0.77 0.85 1 0.53 ...
## $ number_project : int 2 5 7 5 2 2 6 5 5 2 ...
## $ average_montly_hours : int 157 262 272 223 159 153 247 259 224 142 ...
## $ exp_in_company : int 3 6 4 5 3 3 4 5 5 3 ...
## $ Work_accident : int 0 0 0 0 0 0 0 0 0 0 ...
## $ left : num 1 1 1 1 1 1 1 1 1 1 ...
## $ promotion_last_5years: int 0 0 0 0 0 0 0 0 0 0 ...
## $ role : Factor w/ 10 levels "accounting","hr",..: 8 8 8 8 8 8 8 8 8 8 ...
## $ salary : Factor w/ 3 levels "high","low","medium": 2 3 3 2 2 2 2 2 2 2 ...
# Looks like about 76% of employees stayed and 24% of employees left.
# NOTE: When performing cross validation, its important to maintain this turnover ratio
attrition<-as.factor(data_set$left)
summary(attrition)
## 0 1
## 11428 3571
perc_attrition_rate<-sum(data_set$left/length(data_set$left))*100
#percentage of attrition
print(perc_attrition_rate)
## [1] 23.80825
# Overview of summary (Turnover V.S. Non-turnover)
cor_vars<-data_set[,c("satisfaction_level","last_evaluation","number_project","average_montly_hours","exp_in_company","Work_accident","left","promotion_last_5years")]
aggregate(cor_vars[,c("satisfaction_level","last_evaluation","number_project","average_montly_hours","exp_in_company","Work_accident","promotion_last_5years")], by=list(Category=cor_vars$left), FUN=mean)
## Category satisfaction_level last_evaluation number_project
## 1 0 0.6668096 0.7154734 3.786664
## 2 1 0.4400980 0.7181126 3.855503
## average_montly_hours exp_in_company Work_accident promotion_last_5years
## 1 199.0602 3.380032 0.17500875 0.026251313
## 2 207.4192 3.876505 0.04732568 0.005320638
Moderate Positively Correlated Features:
Moderate Negatively Correlated Feature:
Stop and Think:
Summary:
From the heatmap, there is a positive(+) correlation between projectCount, averageMonthlyHours, and evaluation. Which could mean that the employees who spent more hours and did more projects were evaluated highly.
For the negative(-) relationships, turnover and satisfaction are highly correlated. I’m assuming that people tend to leave a company more when they are less satisfied.
#Correlation Matrix
library(reshape2)
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.5.3
cor_vars<-data_set[,c("satisfaction_level","last_evaluation","number_project","average_montly_hours","exp_in_company","Work_accident","left","promotion_last_5years")]
cor(cor_vars)
## satisfaction_level last_evaluation number_project
## satisfaction_level 1.00000000 0.105021214 -0.142969586
## last_evaluation 0.10502121 1.000000000 0.349332589
## number_project -0.14296959 0.349332589 1.000000000
## average_montly_hours -0.02004811 0.339741800 0.417210634
## exp_in_company -0.10086607 0.131590722 0.196785891
## Work_accident 0.05869724 -0.007104289 -0.004740548
## left -0.38837498 0.006567120 0.023787185
## promotion_last_5years 0.02560519 -0.008683768 -0.006063958
## average_montly_hours exp_in_company Work_accident
## satisfaction_level -0.020048113 -0.100866073 0.058697241
## last_evaluation 0.339741800 0.131590722 -0.007104289
## number_project 0.417210634 0.196785891 -0.004740548
## average_montly_hours 1.000000000 0.127754910 -0.010142888
## exp_in_company 0.127754910 1.000000000 0.002120418
## Work_accident -0.010142888 0.002120418 1.000000000
## left 0.071287179 0.144822175 -0.154621634
## promotion_last_5years -0.003544414 0.067432925 0.039245435
## left promotion_last_5years
## satisfaction_level -0.38837498 0.025605186
## last_evaluation 0.00656712 -0.008683768
## number_project 0.02378719 -0.006063958
## average_montly_hours 0.07128718 -0.003544414
## exp_in_company 0.14482217 0.067432925
## Work_accident -0.15462163 0.039245435
## left 1.00000000 -0.061788107
## promotion_last_5years -0.06178811 1.000000000
trans<-cor(cor_vars)
melted_cormat <- melt(trans)
ggplot(data = melted_cormat, aes(x=Var1, y=Var2, fill=value)) +
geom_tile() +theme(axis.text.x = element_text(angle = 90, hjust = 1))
One-Sample T-Test (Measuring Satisfaction Level) A one-sample t-test checks whether a sample mean differs from the population mean. Let’s test to see whether the average satisfaction level of employees that had a turnover differs from the entire employee population.
Hypothesis Testing: Is there significant difference in the means of satisfaction level between employees who had a turnover and the entire employee population?
Null Hypothesis: (H0: pTS = pES) The null hypothesis would be that there is no difference in satisfaction level between employees who did turnover and the entire employee population.
Alternate Hypothesis: (HA: pTS != pES) The alternative hypothesis would be that there is a difference in satisfaction level between employees who did turnover and the entire employee population.
# Let's compare the means of our employee turnover satisfaction against the employee population satisfaction
emp_population_satisfaction <-mean(data_set$satisfaction_level)
left_pop<-subset(data_set,left==1)
emp_turnover_satisfaction <-mean(left_pop$satisfaction_level)
print( c('The mean for the employee population is: ', emp_population_satisfaction) )
## [1] "The mean for the employee population is: "
## [2] "0.612833522234816"
print( c('The mean for the employees that had a turnover is: ' ,emp_turnover_satisfaction) )
## [1] "The mean for the employees that had a turnover is: "
## [2] "0.440098011761411"
Let’s conduct a t-test at 95% confidence level and see if it correctly rejects the null hypothesis that the sample comes from the same distribution as the employee population. To conduct a one sample t-test, we can use the stats.ttest_1samp() function:
t.test(left_pop$satisfaction_level,mu=emp_population_satisfaction) # Employee Population satisfaction mean
##
## One Sample t-test
##
## data: left_pop$satisfaction_level
## t = -39.109, df = 3570, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0.6128335
## 95 percent confidence interval:
## 0.4314385 0.4487576
## sample estimates:
## mean of x
## 0.440098
The test result shows the test statistic “t” is equal to -39.109. This test statistic tells us how much the sample mean deviates from the null hypothesis. If the t-statistic lies outside the quantiles of the t-distribution corresponding to our confidence level and degrees of freedom, we reject the null hypothesis. We can check the quantiles with stats.t.ppf():
If the t-statistic value we calculated above (-39.109) is outside the quantiles, then we can reject the null hypothesis
#degress of freedom
dof<-sum(as.numeric(data_set$left))
LQ <-qt(0.025,dof) # Left Quartile
RQ <-qt(0.975,dof) # Right Quartile
print (c('The t-distribution left quartile range is: ',LQ))
## [1] "The t-distribution left quartile range is: "
## [2] "-1.96062852159556"
print (c('The t-distribution right quartile range is: ' ,RQ))
## [1] "The t-distribution right quartile range is: "
## [2] "1.96062852159556"
T-Test = -39.109 | P-Value = 9.01e-279 | Reject Null Hypothesis Reject the null hypothesis because:
Based on the statistical analysis of a one sample t-test, there seems to be some significant difference between the mean satisfaction of employees who had a turnover and the entire employee population. The super low P-value of 9.012e-279 at a 5% confidence level is a good indicator to reject the null hypothesis.
But this does not neccessarily mean that there is practical significance. We would have to conduct more experiments or maybe collect more data about the employees in order to come up with a more accurate finding.
Summary: Let’s examine the distribution on some of the employee’s features. Here’s what I found:
Satisfaction - There is a huge spike for employees with low satisfaction and high satisfaction.
Evaluation - There is a bimodal distrubtion of employees for low evaluations (less than 0.6) and high evaluations (more than 0.8)
AverageMonthlyHours - There is another bimodal distribution of employees with lower and higher average monthly hours (less than 150 hours & more than 250 hours)
If you look back at the correlation matrix, the high correlation between evaluation and averageMonthlyHours does support this finding.
Stop and Think:
Is there a reason for the high spike in low satisfaction of employees? Could employees be grouped in a way with these features? Is there a correlation between evaluation and averageMonthlyHours?
par(mfrow=c(1,3))
hist(data_set$satisfaction_level, col="green")
hist(data_set$last_evaluation, col="red")
hist(data_set$average_montly_hours, col="blue")
Summary: This is not unusual. Here’s what I found:
Stop and Think:
vis_1<-table(data_set$salary,data_set$left)
#print(vis_1)
d_vis_1<-as.data.frame(vis_1)
print(d_vis_1)
## Var1 Var2 Freq
## 1 high 0 1155
## 2 low 0 5144
## 3 medium 0 5129
## 4 high 1 82
## 5 low 1 2172
## 6 medium 1 1317
library(ggplot2)
p<-ggplot(d_vis_1, aes(x=Var1,y=Freq,fill=Var2)) +
geom_bar(position="dodge",stat='identity') + coord_flip()
print(p)
Summary: Let’s see more information about the departments. Here’s what I found:
Stop and Think:
If we had more information on each department, can we pinpoint a more direct cause for employee turnover?
vis_2<-table(data_set$role,data_set$left)
d_vis_2<-as.data.frame(vis_2)
d_vis_2<-subset(d_vis_2,Var2==1)
#print(d_vis_2)
library(ggplot2)
d_vis_2$Var1 <- factor(d_vis_2$Var1, levels = d_vis_2$Var1[order(-d_vis_2$Freq)])
p<-ggplot(d_vis_2, aes(x=Var1,y=Freq,fill=Var1)) +
geom_bar(stat='identity') +theme(axis.text.x = element_text(angle = 90, hjust = 1))
print(p)
Summary: This graph is quite interesting as well. Here’s what I found:
There is an increase in employee turnover rate as project count increases
Stop and Think:
vis_3<-table(data_set$number_project,data_set$left)
d_vis_3<-as.data.frame(vis_3)
#print(d_vis_1)
library(ggplot2)
p<-ggplot(d_vis_3, aes(x=Var1,y=Freq,fill=Var2)) +
geom_bar(position="dodge",stat='identity') + coord_flip()
print(p)
Summary:
# Kernel Density Plot
left_data<-subset(data_set,left==1)
stay_data<-subset(data_set,left==0)
ggplot() + geom_density(aes(x=last_evaluation), colour="red", data=left_data) +
geom_density(aes(x=last_evaluation), colour="blue", data=stay_data)
Summary:
#KDEPlot: Kernel Density Estimate Plot
ggplot() + geom_density(aes(x=average_montly_hours), colour="red", data=left_data) +
geom_density(aes(x=average_montly_hours), colour="blue", data=stay_data)
Summary:
#KDEPlot: Kernel Density Estimate Plot
ggplot() + geom_density(aes(x=satisfaction_level), colour="red", data=left_data) +
geom_density(aes(x=satisfaction_level), colour="blue", data=stay_data)
Summary:
Stop and Think:
#ProjectCount VS AverageMonthlyHours [BOXPLOT]
#Looks like the average employees who stayed worked about 200hours/month. Those that had a turnover worked about 250hours/month and 150hours/month
library(ggplot2)
p<-ggplot(data_set, aes(x = factor(number_project), y = average_montly_hours, fill = factor(left))) +
geom_boxplot() + scale_fill_manual(values = c("yellow", "orange"))
print(p)
Summary: This graph looks very similar to the graph above. What I find strange with this graph is with the turnover group. There is an increase in evaluation for employees who did more projects within the turnover group. But, again for the non-turnover group, employees here had a consistent evaluation score despite the increase in project counts.
*Questions to think about:*
#ProjectCount VS Evaluation
#Looks like employees who did not leave the company had an average evaluation of around 70% even with different projectCounts
#There is a huge skew in employees who had a turnover though. It drastically changes after 3 projectCounts.
#Employees that had two projects and a horrible evaluation left. Employees with more than 3 projects and super high evaluations left
p<-ggplot(data_set, aes(x = factor(number_project), y = last_evaluation, fill = factor(left))) +
geom_boxplot() + scale_fill_manual(values = c("yellow", "orange"))
print(p)
Summary: This is by far the most compelling graph. This is what I found:
*There are 3 distinct clusters for employees who left the company
Cluster 1 (Hard-working and Sad Employee): Satisfaction was below 0.2 and evaluations were greater than 0.75. Which could be a good indication that employees who left the company were good workers but felt horrible at their job.
Cluster 2 (Bad and Sad Employee): Satisfaction between about 0.35~0.45 and evaluations below ~0.58. This could be seen as employees who were badly evaluated and felt bad at work.
Cluster 3 (Hard-working and Happy Employee): Satisfaction between 0.7~1.0 and evaluations were greater than 0.8. Which could mean that employees in this cluster were “ideal”. They loved their work and were evaluated highly for their performance.
library(ggplot2)
ggplot(data_set, aes(satisfaction_level, last_evaluation, color = left)) +
geom_point(shape = 16, size = 5, show.legend = FALSE) +
theme_minimal() +
scale_color_gradient(low = "#0091ff", high = "#f0650e")
Boruta is a feature selection algorithm. Precisely, it works as a wrapper algorithm around Random Forest. This package derive its name from a demon in Slavic mythology who dwelled in pine forests. Feature selection is a crucial step in predictive modeling. This technique achieves supreme importance when a data set comprised of several variables is given for model building.
Boruta can be your algorithm of choice to deal with such data sets. Particularly when one is interested in understanding the mechanisms related to the variable of interest, rather than just building a black box predictive model with good prediction accuracy.
How does it work?
Below is the step wise working of boruta algorithm:
Firstly, it adds randomness to the given data set by creating shuffled copies of all features (which are called shadow features). Then, it trains a random forest classifier on the extended data set and applies a feature importance measure (the default is Mean Decrease Accuracy) to evaluate the importance of each feature where higher means more important.
At every iteration, it checks whether a real feature has a higher importance than the best of its shadow features (i.e. whether the feature has a higher Z score than the maximum Z score of its shadow features) and constantly removes features which are deemed highly unimportant.
Finally, the algorithm stops either when all features gets confirmed or rejected or it reaches a specified limit of random forest runs.
library(Boruta)
## Loading required package: ranger
data_set$left<-as.factor(data_set$left)
boruta.train <- Boruta(left~., data = data_set, doTrace = 2)
## 1. run of importance source...
## 2. run of importance source...
## 3. run of importance source...
## 4. run of importance source...
## 5. run of importance source...
## 6. run of importance source...
## 7. run of importance source...
## 8. run of importance source...
## 9. run of importance source...
## 10. run of importance source...
## After 10 iterations, +1.9 mins:
## confirmed 9 attributes: average_montly_hours, exp_in_company, last_evaluation, number_project, promotion_last_5years and 4 more;
## no more attributes left.
print(boruta.train)
## Boruta performed 10 iterations in 1.940593 mins.
## 9 attributes confirmed important: average_montly_hours,
## exp_in_company, last_evaluation, number_project,
## promotion_last_5years and 4 more;
## No attributes deemed unimportant.
plot(boruta.train, xlab = "", xaxt = "n")
lz<-lapply(1:ncol(boruta.train$ImpHistory),function(i)
boruta.train$ImpHistory[is.finite(boruta.train$ImpHistory[,i]),i])
names(lz) <- colnames(boruta.train$ImpHistory)
Labels <- sort(sapply(lz,median))
axis(side = 1,las=2,labels = names(Labels),
at = 1:ncol(boruta.train$ImpHistory), cex.axis = 0.7)
#library(Boruta)
#train$left<-as.factor(train$left)
#boruta.train <- Boruta(left~., data = train, doTrace = 2)
#print(boruta.train)
#plot(boruta.train, xlab = "", xaxt = "n")
#lz<-lapply(1:ncol(boruta.train$ImpHistory),function(i)
#boruta.train$ImpHistory[is.finite(boruta.train$ImpHistory[,i]),i])
#names(lz) <- colnames(boruta.train$ImpHistory)
#Labels <- sort(sapply(lz,median))
#axis(side = 1,las=2,labels = names(Labels),
#at = 1:ncol(boruta.train$ImpHistory), cex.axis = 0.7)
Key Observations: The above graph clearly represents the factors which serve as the top reasons for attrition in a company:
Satisfaction level: it already had a negative corellation with the outcome. People with low satisfaction were most likely to leave even when compared with evaluations(Evident cluster was formed with respect to low satisfaction)
Salary and the role they played has one of the least impact on attrition
Pressure due to the number of projects and how they were evaluated also holds key significance in determining attrition
Logistic Regression commonly deals with the issue of how likely an observation is to belong to each group. This model is commonly used to predict the likelihood of an event occurring. In contrast to linear regression, the output of logistic regression is transformed with a logit function. This makes the output either 0 or 1. This is a useful model to take advantage of for this problem because we are interested in predicting whether an employee will leave (0) or stay (1).
Another reason for why logistic regression is the preferred model of choice is because of its interpretability. Logistic regression predicts the outcome of the response variable (turnover) through a set of other explanatory variables, also called predictors. In context of this domain, the value of our response variable is categorized into two forms: 0 (zero) or 1 (one). The value of 0 (zero) represents the probability of an employee not leaving the company and the value of 1 (one) represents the probability of an employee leaving the company.
The equation above shows the relationship between, the dependent variable (success), denoted as (θ) and independent variables or predictor of event, denoted as xi. Where α is the constant of the equation and, β is the coefficient of the predictor variables
#Creating training and test sets for the logistic regression
smp_size <- floor(0.75 * nrow(data_set))
## set the seed to make your partition reproductible
set.seed(123)
train_ind <- sample(seq_len(nrow(data_set)), size = smp_size)
train <- data_set[train_ind, ]
test <- data_set[-train_ind, ]
dim(test)
## [1] 3750 10
dim(train)
## [1] 11249 10
library(gmodels)
library (Hmisc)
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:plyr':
##
## is.discrete, summarize
## The following objects are masked from 'package:base':
##
## format.pval, units
library (caTools)
library (ROCR)
## Loading required package: gplots
##
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
##
## lowess
logit_model<-glm(left~satisfaction_level+last_evaluation+average_montly_hours+salary+role+number_project,data=train,binomial())
summary(logit_model)
##
## Call:
## glm(formula = left ~ satisfaction_level + last_evaluation + average_montly_hours +
## salary + role + number_project, family = binomial(), data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8701 -0.6965 -0.4571 -0.1801 2.8414
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.7889072 0.2090275 -3.774 0.000161 ***
## satisfaction_level -4.1663664 0.1108697 -37.579 < 2e-16 ***
## last_evaluation 0.8050924 0.1653468 4.869 1.12e-06 ***
## average_montly_hours 0.0048773 0.0005781 8.437 < 2e-16 ***
## salarylow 1.7702386 0.1401465 12.631 < 2e-16 ***
## salarymedium 1.2773270 0.1414040 9.033 < 2e-16 ***
## rolehr 0.0576180 0.1460020 0.395 0.693110
## roleIT -0.2350453 0.1353532 -1.737 0.082470 .
## rolemanagement -0.3395424 0.1745319 -1.945 0.051721 .
## rolemarketing -0.0444521 0.1451552 -0.306 0.759423
## roleproduct_mng -0.2174759 0.1456506 -1.493 0.135402
## roleRandD -0.6158662 0.1599300 -3.851 0.000118 ***
## rolesales -0.1107893 0.1127533 -0.983 0.325814
## rolesupport -0.0168767 0.1202623 -0.140 0.888398
## roletechnical -0.0038007 0.1173905 -0.032 0.974172
## number_project -0.2601256 0.0236473 -11.000 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 12345 on 11248 degrees of freedom
## Residual deviance: 10184 on 11233 degrees of freedom
## AIC: 10216
##
## Number of Fisher Scoring iterations: 5
test$logit_model<-predict(logit_model,test)
#head(test)
colAUC(test$logit_model,test$left, plotROC=TRUE)
## [,1]
## 0 vs. 1 0.7837071
#An approach to identify the cut-off for the predicted probabilities
#is to use a binned table of the proababilities. See the exact threshold
#where 0's and 1's are getting classified with high precision and recall
#you can use the two commented lines below to get the threshold manually
#test$logit_model_bin <- cut2(test$logit_model,g=12)
#CrossTable(test$left, test$logit_model_bin,prop.chisq=FALSE,prop.r=FALSE,prop.t=FALSE)
#Now using that threshold created the predicted values for each record
test$prediction<-ifelse(test$logit_model>=-.95,1,0)
#Make use of the confusion matrix to calculate accuracy, precision and recall
#CrossTable(test$left, test$prediction,prop.chisq=FALSE,prop.r=FALSE,prop.t=FALSE)
conf_mat<-table(test$left,test$prediction)
#print(conf_mat)
#class(conf_mat)
accuracy<-(conf_mat[1,1]+conf_mat[2,2])/(conf_mat[1,1]+conf_mat[2,2]+conf_mat[1,2]+conf_mat[2,1])
recall<-(conf_mat[2,2])/(conf_mat[1,2]+conf_mat[2,2])
precision<-(conf_mat[2,2])/(conf_mat[2,2]+conf_mat[2,1])
print(c("Accuracy:",accuracy))
## [1] "Accuracy:" "0.7616"
print(c("Precision:",precision))
## [1] "Precision:" "0.695749440715884"
print(c("Recall:",recall))
## [1] "Recall:" "0.5"
#red <- prediction(test$prediction, test$left)
#P.perf <- performance(pred, "prec", "rec")
#lot (RP.perf)
The best model performance out of the four (Decision Tree Model, AdaBoost Model, Logistic Regression Model, Random Forest Model) was Random Forest!
Note: Base Rate
A Base Rate Model is a model that always selects the target variable’s majority class. It’s just used for reference to compare how better another model is against it. In this dataset, the majority class that will be predicted will be 0’s, which are employees who did not leave the company.
If you recall back to Part 3: Exploring the Data, 24% of the dataset contained 1’s (employee who left the company) and the remaining 76% contained 0’s (employee who did not leave the company). The Base Rate Model would simply predict every 0’s and ignore all the 1’s.
Example: The base rate accuracy for this data set, when classifying everything as 0’s, would be 76% because 76% of the dataset are labeled as 0’s (employees not leaving the company).
Note: Different Ways to Evaluate Classification Models
Caption for the picture.
# load the library
library(caret)
##
## Attaching package: 'caret'
## The following object is masked from 'package:survival':
##
## cluster
# load the iris dataset
# define training control
train_control <- trainControl(method="cv", number=10)
train$left<-as.factor(train$left)
# fix the parameters of the algorithm
grid <- expand.grid()
# train the model
model <- train(left~., data=train, trControl=train_control, method="glm",family=binomial())
#model <- train(left~satisfaction_level+last_evaluation+number_project+exp_in_company+average_montly_hours, data=train, trControl=train_control, method="glm",family=binomial())
# summarize results
print(model)
## Generalized Linear Model
##
## 11249 samples
## 9 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 10125, 10124, 10124, 10124, 10124, 10125, ...
## Resampling results:
##
## Accuracy Kappa
## 0.7884257 0.3188393
pred_log<-predict(model, newdata = test, type = "prob")
pred_log_c<-predict(model, newdata = test, type = "raw")
test_log_model<-pred_log[,2]
test_log_model_1<-data.frame(test_log_model)
colAUC(test_log_model_1$test_log_model,test$left, plotROC=TRUE)
## [,1]
## 0 vs. 1 0.8202086
conf_mat<-table(as.factor(pred_log_c),as.factor(test$left))
print(conf_mat)
##
## 0 1
## 0 2665 574
## 1 191 320
#class(conf_mat)
accuracy<-(conf_mat[1,1]+conf_mat[2,2])/(conf_mat[1,1]+conf_mat[2,2]+conf_mat[1,2]+conf_mat[2,1])
recall<-(conf_mat[2,2])/(conf_mat[1,2]+conf_mat[2,2])
precision<-(conf_mat[2,2])/(conf_mat[2,2]+conf_mat[2,1])
print(c("Accuracy:",accuracy))
## [1] "Accuracy:" "0.796"
print(c("Precision:",precision))
## [1] "Precision:" "0.626223091976517"
print(c("Recall:",recall))
## [1] "Recall:" "0.357941834451902"
# NOTE: By adding in "class_weight = balanced", the Logistic Auc increased by about 10%! This adjusts the threshold value
# Decision Tree Model
library(rpart.plot)
## Loading required package: rpart
##
## Attaching package: 'rpart'
## The following object is masked from 'package:survival':
##
## solder
trctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 3)
set.seed(3333)
train$left<-as.factor(train$left)
dtree_fit <- train(left ~., data = train, method = "rpart",
parms = list(split = "information"),
trControl=trctrl,
tuneLength = 10)
#plot decision tree
p<-prp(dtree_fit$finalModel, box.palette = "Reds", tweak = 1.2)
print(p)
## $obj
## n= 11249
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 11249 2677 0 (0.76202329 0.23797671)
## 2) satisfaction_level>=0.465 8118 785 0 (0.90330131 0.09669869)
## 4) exp_in_company< 4.5 6629 99 0 (0.98506562 0.01493438) *
## 5) exp_in_company>=4.5 1489 686 0 (0.53928811 0.46071189)
## 10) last_evaluation< 0.805 574 24 0 (0.95818815 0.04181185) *
## 11) last_evaluation>=0.805 915 253 1 (0.27650273 0.72349727)
## 22) exp_in_company>=6.5 126 0 0 (1.00000000 0.00000000) *
## 23) exp_in_company< 6.5 789 127 1 (0.16096324 0.83903676)
## 46) average_montly_hours< 214 85 14 0 (0.83529412 0.16470588) *
## 47) average_montly_hours>=214 704 56 1 (0.07954545 0.92045455)
## 94) satisfaction_level< 0.705 35 7 0 (0.80000000 0.20000000) *
## 95) satisfaction_level>=0.705 669 28 1 (0.04185351 0.95814649) *
## 3) satisfaction_level< 0.465 3131 1239 1 (0.39572022 0.60427978)
## 6) number_project>=2.5 1847 755 0 (0.59122902 0.40877098)
## 12) satisfaction_level>=0.115 1183 91 0 (0.92307692 0.07692308) *
## 13) satisfaction_level< 0.115 664 0 1 (0.00000000 1.00000000) *
## 7) number_project< 2.5 1284 147 1 (0.11448598 0.88551402)
## 14) last_evaluation>=0.575 87 5 0 (0.94252874 0.05747126) *
## 15) last_evaluation< 0.575 1197 65 1 (0.05430242 0.94569758)
## 30) average_montly_hours>=162 36 6 0 (0.83333333 0.16666667) *
## 31) average_montly_hours< 162 1161 35 1 (0.03014643 0.96985357)
## 62) average_montly_hours< 125.5 17 0 0 (1.00000000 0.00000000) *
## 63) average_montly_hours>=125.5 1144 18 1 (0.01573427 0.98426573) *
##
## $snipped.nodes
## NULL
##
## $xlim
## [1] 0 1
##
## $ylim
## [1] 0 1
##
## $x
## [1] 0.36494386 0.10121162 0.02625852 0.17616472 0.10121837 0.25111108
## [7] 0.17623218 0.32598997 0.25178566 0.40019428 0.33273582 0.46765275
## [13] 0.62867610 0.46846225 0.40100378 0.53592071 0.74571654 0.67083764
## [19] 0.82059543 0.74639112 0.89479974 0.82734128 0.96225821
##
## $y
## [1] 0.95182258 0.79547965 0.63913671 0.63913671 0.48279378 0.48279378
## [7] 0.32645084 0.32645084 0.17010791 0.17010791 0.01376498 0.01376498
## [13] 0.79547965 0.63913671 0.48279378 0.48279378 0.63913671 0.48279378
## [19] 0.48279378 0.32645084 0.32645084 0.17010791 0.17010791
##
## $branch.x
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## x 0.3649439 0.1012116 0.02625852 0.1761647 0.1012184 0.2511111 0.1762322
## NA 0.3121974 0.08622100 0.1162022 0.1611755 0.1911540 0.2361353
## NA 0.3649439 0.10121162 0.1012116 0.1761647 0.1761647 0.2511111
## [,8] [,9] [,10] [,11] [,12] [,13] [,14]
## x 0.3259900 0.2517857 0.4001943 0.3327358 0.4676527 0.6286761 0.4684622
## 0.2660869 0.3111491 0.3408308 0.3867026 0.4136860 0.4176903 0.5793640
## 0.2511111 0.3259900 0.3259900 0.4001943 0.4001943 0.3649439 0.6286761
## [,15] [,16] [,17] [,18] [,19] [,20] [,21]
## x 0.4010038 0.5359207 0.7457165 0.6708376 0.8205954 0.7463911 0.8947997
## 0.4549706 0.4819539 0.6348148 0.7307408 0.7606923 0.8057546 0.8354363
## 0.4684622 0.4684622 0.6286761 0.7457165 0.7457165 0.8205954 0.8205954
## [,22] [,23]
## x 0.8273413 0.9622582
## 0.8813081 0.9082914
## 0.8947997 0.8947997
##
## $branch.y
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## y 0.9575063 0.8011633 0.6904954 0.6448204 0.5341525 0.4884775 0.3778095
## NA 0.9518226 0.7954796 0.7954796 0.6391367 0.6391367 0.4827938
## NA 0.9518226 0.7954796 0.7954796 0.6391367 0.6391367 0.4827938
## [,8] [,9] [,10] [,11] [,12] [,13] [,14]
## y 0.3321345 0.2214666 0.1757916 0.06512368 0.06512368 0.8011633 0.6448204
## 0.4827938 0.3264508 0.3264508 0.17010791 0.17010791 0.9518226 0.7954796
## 0.4827938 0.3264508 0.3264508 0.17010791 0.17010791 0.9518226 0.7954796
## [,15] [,16] [,17] [,18] [,19] [,20] [,21]
## y 0.5341525 0.5341525 0.6448204 0.5341525 0.4884775 0.3778095 0.3321345
## 0.6391367 0.6391367 0.7954796 0.6391367 0.6391367 0.4827938 0.4827938
## 0.6391367 0.6391367 0.7954796 0.6391367 0.6391367 0.4827938 0.4827938
## [,22] [,23]
## y 0.2214666 0.2214666
## 0.3264508 0.3264508
## 0.3264508 0.3264508
##
## $labs
## [1] NA NA "0" NA "0" NA "0" NA "0" NA "0" "1" NA NA "0" "1" NA
## [18] "0" NA "0" NA "0" "1"
##
## $cex
## [1] 1.2
##
## $boxes
## $boxes$x1
## [1] NA NA 0.004099429 NA 0.079059275
## [6] NA 0.154073087 NA 0.229626567 NA
## [11] 0.310576725 0.445493654 NA NA 0.378844691
## [16] 0.513761620 NA 0.648678548 NA 0.724232029
## [21] NA 0.805182186 0.940099115
##
## $boxes$y1
## [1] NA NA 0.62201715 NA 0.46567421
## [6] NA 0.30933128 NA 0.15298834 NA
## [11] -0.00335459 -0.00335459 NA NA 0.46567421
## [16] 0.46567421 NA 0.46567421 NA 0.30933128
## [21] NA 0.15298834 0.15298834
##
## $boxes$x2
## [1] NA NA 0.04841761 NA 0.12337746 NA
## [7] 0.19839127 NA 0.27394475 NA 0.35489491 0.48981184
## [13] NA NA 0.42316288 0.55807980 NA 0.69299673
## [19] NA 0.76855021 NA 0.84950037 0.98441730
##
## $boxes$y2
## [1] NA NA 0.69049541 NA 0.53415248 NA
## [7] 0.37780955 NA 0.22146661 NA 0.06512368 0.06512368
## [13] NA NA 0.53415248 0.53415248 NA 0.53415248
## [19] NA 0.37780955 NA 0.22146661 0.22146661
##
##
## $split.labs
## [1] ""
##
## $split.cex
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##
## $split.box
## $split.box$x1
## [1] 0.23071090 -0.02322021 NA 0.04789767 NA
## [6] 0.11730425 NA 0.19644451 NA 0.28428518
## [11] NA NA 0.48549427 0.33422928 NA
## [16] NA 0.60807448 NA 0.68167497 NA
## [21] 0.76525428 NA NA
##
## $split.box$y1
## [1] 0.9347030 0.7783601 NA 0.6220171 NA 0.4656742 NA
## [8] 0.3093313 NA 0.1529883 NA NA 0.7783601 0.6220171
## [15] NA NA 0.6220171 NA 0.4656742 NA 0.3093313
## [22] NA NA
##
## $split.box$x2
## [1] 0.4991768 0.2256434 NA 0.3044318 NA 0.3849179 NA
## [8] 0.4555354 NA 0.5161034 NA NA 0.7718579 0.6026952
## [15] NA NA 0.8833586 NA 0.9595159 NA 1.0243452
## [22] NA NA
##
## $split.box$y2
## [1] 1.0031813 0.8468383 NA 0.6904954 NA 0.5341525 NA
## [8] 0.3778095 NA 0.2214666 NA NA 0.8468383 0.6904954
## [15] NA NA 0.6904954 NA 0.5341525 NA 0.3778095
## [22] NA NA
pred_d3<-predict(dtree_fit, newdata = test, type = "prob")
pred_d3_c<-predict(dtree_fit, newdata = test, type = "raw")
test_rpart_model<-pred_d3[,2]
test_rpart_model_1<-data.frame(test_rpart_model)
str(test_rpart_model_1)
## 'data.frame': 3750 obs. of 1 variable:
## $ test_rpart_model: num 0.958 1 0.958 0.984 1 ...
colAUC(test_rpart_model_1$test_rpart_model,test$left, plotROC=TRUE)
## [,1]
## 0 vs. 1 0.9742972
conf_mat<-table(as.factor(pred_d3_c),as.factor(test$left))
print(conf_mat)
##
## 0 1
## 0 2831 69
## 1 25 825
#class(conf_mat)
accuracy<-(conf_mat[1,1]+conf_mat[2,2])/(conf_mat[1,1]+conf_mat[2,2]+conf_mat[1,2]+conf_mat[2,1])
recall<-(conf_mat[2,2])/(conf_mat[1,2]+conf_mat[2,2])
precision<-(conf_mat[2,2])/(conf_mat[2,2]+conf_mat[2,1])
print(c("Accuracy:",accuracy))
## [1] "Accuracy:" "0.974933333333333"
print(c("Precision:",precision))
## [1] "Precision:" "0.970588235294118"
print(c("Recall:",recall))
## [1] "Recall:" "0.922818791946309"
# Random Forest Model
library(rpart.plot)
library(caret)
train$left<-as.factor(train$left)
#print("uncomment code and run. It generally kills the kernel because of the time taken by Random forest")
#ctrl = trainControl(method="repeatedcv", number=10, repeats=5, selectionFunction ="oneSE")
#rf_model<-train(left~.,data=train,method="rf",trControl=ctrl,prox=TRUE,allowParallel=TRUE)
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ranger':
##
## importance
## The following object is masked from 'package:ggplot2':
##
## margin
train$left<-as.factor(train$left)
rf_model <- randomForest(left~.,data=train)
print("random forest")
## [1] "random forest"
# View the forest results.
print(rf_model)
##
## Call:
## randomForest(formula = left ~ ., data = train)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 0.86%
## Confusion matrix:
## 0 1 class.error
## 0 8561 11 0.001283248
## 1 86 2591 0.032125514
# Importance of each predictor.
print(importance(rf_model,type = 2))
## MeanDecreaseGini
## satisfaction_level 1344.528758
## last_evaluation 488.942702
## number_project 745.018203
## average_montly_hours 586.455485
## exp_in_company 768.389375
## Work_accident 22.615682
## promotion_last_5years 4.144949
## role 64.040424
## salary 32.328103
varImpPlot(rf_model)
#Prediction(test)
test$rf_model<-predict(rf_model,test, type="prob")
test$rf_model_c<-predict(rf_model,test, type="response")
test_rf_model<-test$rf_model[,2]
test_rf_model_1<-data.frame(test_rf_model)
colAUC(test_rf_model_1$test_rf_model,test$left, plotROC=TRUE)
## [,1]
## 0 vs. 1 0.9964115
conf_mat<-table(as.factor(test$rf_model_c),as.factor(test$left))
#print(conf_mat)
#class(conf_mat)
accuracy<-(conf_mat[1,1]+conf_mat[2,2])/(conf_mat[1,1]+conf_mat[2,2]+conf_mat[1,2]+conf_mat[2,1])
recall<-(conf_mat[2,2])/(conf_mat[1,2]+conf_mat[2,2])
precision<-(conf_mat[2,2])/(conf_mat[2,2]+conf_mat[2,1])
print(c("Accuracy:",accuracy))
## [1] "Accuracy:" "0.992533333333333"
print(c("Precision:",precision))
## [1] "Precision:" "0.992045454545455"
print(c("Recall:",recall))
## [1] "Recall:" "0.976510067114094"
library(fastAdaboost)
## Warning: package 'fastAdaboost' was built under R version 3.5.3
mod_adaboost <- adaboost(left~.,data=train, 10)
pred <- predict(mod_adaboost,newdata=test)
print(table(factor(pred$class),factor(test$left)))
##
## 0 1
## 0 2830 20
## 1 26 874
colAUC(pred$prob[,2],factor(test$left), plotROC=TRUE)
## [,1]
## 0 vs. 1 0.9937249
conf_mat<-table(factor(pred$class),factor(test$left))
accuracy<-(conf_mat[1,1]+conf_mat[2,2])/(conf_mat[1,1]+conf_mat[2,2]+conf_mat[1,2]+conf_mat[2,1])
recall<-(conf_mat[2,2])/(conf_mat[1,2]+conf_mat[2,2])
precision<-(conf_mat[2,2])/(conf_mat[2,2]+conf_mat[2,1])
print(c("Accuracy:",accuracy))
## [1] "Accuracy:" "0.987733333333333"
print(c("Precision:",precision))
## [1] "Precision:" "0.971111111111111"
print(c("Recall:",recall))
## [1] "Recall:" "0.977628635346756"
Summary: With all of this information, this is what Bob should know about his company and why his employees probably left:
Employees generally left when they are underworked (less than 150hr/month or 6hr/day)
Employees generally left when they are overworked (more than 250hr/month or 10hr/day)
Employees with either really high or low evaluations should be taken into consideration for high turnover rate
Employees with low to medium salaries are the bulk of employee turnover
Employees that had 2,6, or 7 project count was at risk of leaving the company
Employee satisfaction is the highest indicator for employee turnover.
Employee that had 4 and 5 yearsAtCompany should be taken into consideration for high turnover rate
library(DALEX)
## Warning: package 'DALEX' was built under R version 3.5.3
## Welcome to DALEX (version: 0.3.0).
## This is a plain DALEX. Use 'install_dependencies()' to get all required packages.
library(breakDown)
# Create a DALEX explainer
explainer_rf <- explain(rf_model,data=train)
colnames(test)
## [1] "satisfaction_level" "last_evaluation"
## [3] "number_project" "average_montly_hours"
## [5] "exp_in_company" "Work_accident"
## [7] "left" "promotion_last_5years"
## [9] "role" "salary"
## [11] "logit_model" "prediction"
## [13] "rf_model" "rf_model_c"
# Calculate variable attributions
new_observation <- test[3,-7]
library("breakDown")
bd_rf <- prediction_breakdown(explainer_rf,new_observation)
bd_rf
## variable contribution
## 1 (Intercept) 2.397305e-01
## satisfaction_level + satisfaction_level = 0.89 -1.360821e-01
## average_montly_hours + average_montly_hours = 224 -9.141968e-03
## role + role = sales -5.387679e-03
## promotion_last_5years + promotion_last_5years = 0 7.858476e-05
## Work_accident + Work_accident = 0 8.836341e-05
## salary + salary = low 7.407947e-03
## number_project + number_project = 5 1.538590e-02
## exp_in_company + exp_in_company = 5 3.086594e-01
## last_evaluation + last_evaluation = 1 5.792611e-01
## 11 final_prognosis 1.000000e+00
## variable_name variable_value cummulative
## 1 Intercept 1 0.23973046
## satisfaction_level satisfaction_level 0.89 0.10364832
## average_montly_hours average_montly_hours 224 0.09450636
## role role sales 0.08911868
## promotion_last_5years promotion_last_5years 0 0.08919726
## Work_accident Work_accident 0 0.08928563
## salary salary low 0.09669357
## number_project number_project 5 0.11207947
## exp_in_company exp_in_company 5 0.42073891
## last_evaluation last_evaluation 1 1.00000000
## 11 1.00000000
## sign position label
## 1 1 1 randomForest
## satisfaction_level -1 2 randomForest
## average_montly_hours -1 3 randomForest
## role -1 4 randomForest
## promotion_last_5years 1 5 randomForest
## Work_accident 1 6 randomForest
## salary 1 7 randomForest
## number_project 1 8 randomForest
## exp_in_company 1 9 randomForest
## last_evaluation 1 10 randomForest
## 11 X 11 randomForest
plot(bd_rf)