In this notebook I will be looking at IBM’s employee attrition data obtained from Kaggle https://www.kaggle.com/pavansubhasht/ibm-hr-analytics-attrition-dataset
My goal in this notebook is to predict the likeklihood of employees leaving teh company and assessing the impact this will have on the company.
library(tidyverse)
library(caret)
Loading required package: lattice
Attaching package: ‘caret’
The following object is masked from ‘package:purrr’:
lift
library(DataExplorer)
setwd("~/HRattrition")
df = read.csv("WA_Fn-UseC_-HR-Employee-Attrition 2.csv")
df$Over18 = NULL
df$EmployeeCount = NULL
df$StandardHours = NULL
There are 35 variables and 1470 observations of these in this dataframe. I will use the head() command to display the first few rows
#taking a quick look at the data
df %>% head()
The plot_histogram() function from DataExplorer is useful to plot the distributions of all numeric values in a dataframe. One thing to note is that Age is the only variable that is nearing normality.
df %>% plot_histogram()
There is no missing values in the whole dataset. This is helpful because missing values are problematic when modeling.
df %>% plot_missing()
A quick correlation matrix of the numeric features higlights that WorklifeBalance, YearsAtCompany, YearsInCurrentRole, YearsSinceLastPromotion and YearsWithCurrentManager contain much of the same information and are highly correlated.
df %>% select_if(is.numeric) %>% plot_correlation(type = "c")
The varible of interest for prediction is Attrition a 2 level factor that indicated whether turnover occured with the given observation. I plot a simple barplot of the Attrition variable.
df %>% ggplot(aes(Attrition)) +
geom_bar()+
ggtitle("Employee Turnover")
library(factoextra)
cluster_matrix = df %>% select_if(is.numeric) %>% select(-EmployeeNumber)
head(cluster_matrix)
fviz_nbclust(cluster_matrix, kmeans, method = "gap_stat")
Clustering k = 1,2,..., K.max (= 10): .. done
Bootstrapping, b = 1,2,..., B (= 100) [one "." per sample]:
....................................
model = kmeans(cluster_matrix, 2)
model$withinss
[1] 25315831398 25550069498
model$tot.withinss
[1] 50865900897
km.res <- kmeans(scale(cluster_matrix), 2)
# 3. Visualize
library("factoextra")
library(plotly)
Attaching package: ‘plotly’
The following object is masked from ‘package:ggplot2’:
last_plot
The following object is masked from ‘package:stats’:
filter
The following object is masked from ‘package:graphics’:
layout
h = fviz_cluster(km.res, data = cluster_matrix,
#palette = c("#00AFBB","#2E9FDF"),
geom = "point",
ggtheme = theme_minimal(),
main = "Partitioning Clustering Plot",
ellipse.type = "norm"
)
cluster_matrix$cluster = km.res$cluster
cluster_matrix %>% group_by(cluster) %>% summarise_all(mean)
h
The first model that I will fit is a random forest model using the randomForest package. Random forests are my favorite model. They are quite accurate and provide alot of important information about the relationship between the response variable and the predictors.
# I import the caret library which I will use throughout this analysis
library(caret)
# I define a 70 - 30 train test split in the data.
split = createDataPartition(df$Attrition, p = 0.7, list = F)
train = df[split,]
test = df[-split,]
The Random forest model is fairly accurate on the held out test set. It is noteable that specificity is quite low in the model. This hihglights the fact the model is having a difficult time distinguishing between true and false negatives. This moodel does not do the best at predicting employees that do actually end up leaving the company.
library(randomForest)
randomForest 4.6-14
Type rfNews() to see new features/changes/bug fixes.
Attaching package: ‘randomForest’
The following object is masked from ‘package:dplyr’:
combine
The following object is masked from ‘package:ggplot2’:
margin
# I fit a random forest using the default parameters
rf =randomForest::randomForest(Attrition~., data = train)
pred = predict(rf, test, type = "response")
#pred[pred > 0.5] = "Yes"
#pred[pred < 0.5] = "No"
mean(pred == test$Attrition)
[1] 0.8613636
confusionMatrix(pred, test$Attrition)
Confusion Matrix and Statistics
Reference
Prediction No Yes
No 367 59
Yes 2 12
Accuracy : 0.8614
95% CI : (0.8255, 0.8923)
No Information Rate : 0.8386
P-Value [Acc > NIR] : 0.1076
Kappa : 0.2421
Mcnemar's Test P-Value : 7.496e-13
Sensitivity : 0.9946
Specificity : 0.1690
Pos Pred Value : 0.8615
Neg Pred Value : 0.8571
Prevalence : 0.8386
Detection Rate : 0.8341
Detection Prevalence : 0.9682
Balanced Accuracy : 0.5818
'Positive' Class : No
rf
Call:
randomForest(formula = Attrition ~ ., data = train)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 5
OOB estimate of error rate: 14.17%
Confusion matrix:
No Yes class.error
No 854 10 0.01157407
Yes 136 30 0.81927711
According to the random forest model, the Monthly income and age of an employee are the most important indicators of turnover.
plot1= varImpPlot(rf, n.var = 17)
df %>% ggplot(aes(MonthlyIncome, DailyRate, color = Attrition))+
geom_point()+
geom_smooth(method = "lm")
df$Attrition = as.factor(df$Attrition)
library(pdp)
Attaching package: ‘pdp’
The following object is masked from ‘package:purrr’:
partial
rf %>%
partial(pred.var = "DailyRate") %>%
plotPartial(rug = TRUE, train = train, main = "Monthly Income Partial Dependence Plot")
library(pdp)
rf %>%
partial(pred.var = "Age") %>%
plotPartial(rug = TRUE, train = train, main = "Age Partial Dependence Plot")
library(pdp)
rf %>%
partial(pred.var = "OverTime") %>%
plotPartial(rug = TRUE, train = train)
set.seed(2017)
ctrl <- trainControl(method = "repeatedcv", repeats = 3)
regFit <- train(
Attrition ~ .,
data = train %>% drop_na(),
trControl = ctrl,
method = "glm",
#family = "binomial",
## Center and scale the predictors for the training
## set and all future samples.
preProc = c("center", "scale"),
tuneLength = 10
)
pred = predict(regFit, test, type = "prob")
pred2 = predict(regFit, test)
prop.table(confusionMatrix(pred2, test$Attrition)$table)
Reference
Prediction No Yes
No 0.80454545 0.09318182
Yes 0.03409091 0.06818182
plot(varImp(regFit),top = 15, main = "GLM variable importance", ylab = "Predictor")
summary(regFit$finalModel)
Call:
NULL
Deviance Residuals:
Min 1Q Median 3Q Max
-1.8441 -0.4763 -0.2460 -0.0854 3.5209
Coefficients:
Estimate Std. Error z value
(Intercept) -2.813e+00 6.010e+00 -0.468
Age -3.688e-01 1.498e-01 -2.461
BusinessTravelTravel_Frequently 6.262e-01 1.821e-01 3.439
BusinessTravelTravel_Rarely 3.275e-01 1.931e-01 1.696
DailyRate 7.271e-04 1.074e-01 0.007
`DepartmentResearch & Development` 6.409e+00 3.271e+02 0.020
DepartmentSales 6.256e+00 3.169e+02 0.020
DistanceFromHome 3.281e-01 1.065e-01 3.082
Education 7.411e-03 1.091e-01 0.068
`EducationFieldLife Sciences` -2.997e-01 4.943e-01 -0.606
EducationFieldMarketing -5.484e-02 3.323e-01 -0.165
EducationFieldMedical -1.805e-01 4.702e-01 -0.384
EducationFieldOther -4.489e-02 2.531e-01 -0.177
`EducationFieldTechnical Degree` 1.795e-01 2.911e-01 0.616
EmployeeNumber -1.339e-01 1.136e-01 -1.179
EnvironmentSatisfaction -5.392e-01 1.121e-01 -4.809
GenderMale 2.070e-01 1.121e-01 1.846
HourlyRate -1.428e-02 1.100e-01 -0.130
JobInvolvement -3.755e-01 1.074e-01 -3.496
JobLevel 9.978e-02 4.195e-01 0.238
`JobRoleHuman Resources` 2.691e+00 1.193e+02 0.023
`JobRoleLaboratory Technician` 6.369e-01 2.397e-01 2.657
JobRoleManager 1.817e-01 2.664e-01 0.682
`JobRoleManufacturing Director` -1.025e-02 2.088e-01 -0.049
`JobRoleResearch Director` -6.904e-02 2.481e-01 -0.278
`JobRoleResearch Scientist` 2.773e-01 2.459e-01 1.128
`JobRoleSales Executive` 3.347e-01 5.162e-01 0.648
`JobRoleSales Representative` 5.026e-01 2.914e-01 1.725
JobSatisfaction -3.233e-01 1.075e-01 -3.007
MaritalStatusMarried 1.666e-01 1.712e-01 0.973
MaritalStatusSingle 5.757e-01 2.047e-01 2.812
MonthlyIncome 1.396e-02 4.631e-01 0.030
MonthlyRate 1.288e-01 1.079e-01 1.194
NumCompaniesWorked 5.512e-01 1.170e-01 4.711
OverTimeYes 9.351e-01 1.096e-01 8.530
PercentSalaryHike -8.960e-02 1.759e-01 -0.509
PerformanceRating 6.297e-02 1.756e-01 0.359
RelationshipSatisfaction -2.885e-01 1.098e-01 -2.628
StockOptionLevel -2.570e-01 1.737e-01 -1.480
TotalWorkingYears -5.191e-01 2.848e-01 -1.823
TrainingTimesLastYear -3.192e-01 1.137e-01 -2.807
WorkLifeBalance -1.668e-01 1.064e-01 -1.568
YearsAtCompany 3.640e-01 3.097e-01 1.175
YearsInCurrentRole -4.404e-01 2.120e-01 -2.077
YearsSinceLastPromotion 6.518e-01 1.622e-01 4.018
YearsWithCurrManager -4.532e-01 2.195e-01 -2.064
Pr(>|z|)
(Intercept) 0.639741
Age 0.013855 *
BusinessTravelTravel_Frequently 0.000584 ***
BusinessTravelTravel_Rarely 0.089900 .
DailyRate 0.994599
`DepartmentResearch & Development` 0.984368
DepartmentSales 0.984252
DistanceFromHome 0.002058 **
Education 0.945846
`EducationFieldLife Sciences` 0.544342
EducationFieldMarketing 0.868910
EducationFieldMedical 0.701030
EducationFieldOther 0.859214
`EducationFieldTechnical Degree` 0.537616
EmployeeNumber 0.238585
EnvironmentSatisfaction 1.52e-06 ***
GenderMale 0.064884 .
HourlyRate 0.896657
JobInvolvement 0.000473 ***
JobLevel 0.811971
`JobRoleHuman Resources` 0.982008
`JobRoleLaboratory Technician` 0.007876 **
JobRoleManager 0.495222
`JobRoleManufacturing Director` 0.960830
`JobRoleResearch Director` 0.780842
`JobRoleResearch Scientist` 0.259493
`JobRoleSales Executive` 0.516803
`JobRoleSales Representative` 0.084529 .
JobSatisfaction 0.002640 **
MaritalStatusMarried 0.330395
MaritalStatusSingle 0.004928 **
MonthlyIncome 0.975955
MonthlyRate 0.232598
NumCompaniesWorked 2.47e-06 ***
OverTimeYes < 2e-16 ***
PercentSalaryHike 0.610406
PerformanceRating 0.719831
RelationshipSatisfaction 0.008588 **
StockOptionLevel 0.138934
TotalWorkingYears 0.068321 .
TrainingTimesLastYear 0.004993 **
WorkLifeBalance 0.116872
YearsAtCompany 0.239962
YearsInCurrentRole 0.037809 *
YearsSinceLastPromotion 5.88e-05 ***
YearsWithCurrManager 0.038984 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 909.69 on 1029 degrees of freedom
Residual deviance: 589.07 on 984 degrees of freedom
AIC: 681.07
Number of Fisher Scoring iterations: 15
Fit <- train(
Attrition ~ .,
data = train %>% drop_na(),
trControl = ctrl,
method = "rf",
#family = "binomial",
## Center and scale the predictors for the training
## set and all future samples.
preProc = c("center", "scale"),
tuneLength = 10
)
confusionMatrix(Fit)
Cross-Validated (10 fold, repeated 3 times) Confusion Matrix
(entries are percentual average cell counts across resamples)
Reference
Prediction No Yes
No 82.8 13.2
Yes 1.1 2.9
Accuracy (average) : 0.8573
Fit
Random Forest
1030 samples
31 predictor
2 classes: 'No', 'Yes'
Pre-processing: centered (45), scaled (45)
Resampling: Cross-Validated (10 fold, repeated 3 times)
Summary of sample sizes: 926, 928, 926, 927, 927, 927, ...
Resampling results across tuning parameters:
mtry Accuracy Kappa
2 0.8485464 0.1032056
6 0.8534358 0.1951312
11 0.8560059 0.2198296
16 0.8560154 0.2333909
21 0.8573162 0.2397396
25 0.8560155 0.2477376
30 0.8537624 0.2386850
35 0.8550288 0.2421730
40 0.8537438 0.2453773
45 0.8502087 0.2229154
Accuracy was used to select the optimal model using the
largest value.
The final value used for the model was mtry = 21.
plot(varImp(Fit), 15)
df %>% ggplot(aes(EnvironmentSatisfaction, fill = OverTime))+
geom_histogram(bins = 30)